Manifold | Transcript: AI DOOM: Jesse Hoogland of Timaeus

February 26, 2026 • 103 Minutes

AI DOOM: Jesse Hoogland of Timaeus

Jesse Hoogland: Doom is this this, this ineffable bad feeling you get around what might happen with, with AI right?

So it spans everything from extinction to maybe only the Amish remain, or like some small population of humans remains. But we, we stay on this planet and, and at some point we, we go extinct later. And so somehow we weren't, we were unable to grow up into, into our full potential as a species potentially space, space, fairing, civilization. That never happens And I think my, my p doom is like 10 to 90%

Steve Hsu: Welcome to Manifold. We're here in Berkeley, California at FAR Labs. A facility that I'm told can accommodate 120 researchers at a time, all of whom work on AI safety. Our guest today is Jesse Hoogland, a leading researcher in AI safety whom we are also filming. For our documentary project Dreamers and Doomers. Jesse, welcome to the show.
Jesse Hoogland: Thank you for having me.

Steve Hsu: I was reviewing your biography and I noticed that you are an American who studied in Amsterdam and furthermore in Amsterdam where you earned bachelor's and master's degrees in theoretical physics and computer science, you are interested in what I would call the statistical mechanics of things like bolts, boltman machines and neural networks.

So tell us what you were like in high school and how you ended up in Amsterdam.
Jesse Hoogland: So, I was actually born in Amsterdam. I'm originally Dutch. Spent my first few years in the Netherlands and then moved to the US and spent the rest of my, my childhood there. I think it comes as no surprise that as a high schooler, I was a bit of a nerd.

You know, always appreciated the math and, and science and thought it inevitable that I'd end up in either AI or physics and sort of decided at the last minute, which one to commit to. So I did a, a bachelor's in sort of general science and then a, a master's in theoretical physics. Really already at that point, I couldn't really make my mind up whether I wanted to go down the AI route or the physics route.

I figured, and this is probably familiar to you as a physicist, that physics gives you more optionality, it certainly gives you more confidence. And, and so I, I went with that route.
Yeah. And, and then so, you know, over the years, I think what's happened is what was initially mainly just curiosity and how artificial intelligence works solely got replaced by a sense of dread.

I felt sort of this, this rising sea of maybe panic at what they were capable of and how quickly things were changing and the problems that had been considered, you know, long open. For example, go famously, even language processing suddenly started to fall like dominoes. And, and it's in that backdrop that had decided after the masters to, to switch to, to AI safety.

Steve Hsu: So what year did you finish your master's?

Jesse Hoogland: Since 2021. So in the middle of the, of the pandemic. Yeah, at the time I hadn't, hadn't fully decided to yet to commit to AI so I actually spent a year working on a health tech startup, and then at some point I realized, I don't care about this. This isn't the right thing for me. And so I, I went back to the drawing board and decided at that point to commit myself to, to, to AI safety. So I jumped in, I did MATS 3, which is a fellowship in AI safety, and that was really the turning point to, to start to work on AI safety fulltime.
Steve Hsu: Got it. So I'd like to explore your maths experience in a little more detail.
Mm-hmm. Before doing that, I just want to note that I looked at the titles, I think for both your bachelor's and master's dissertations. Mm-hmm. And in both cases, I think one was maybe Boltman machines and the other one was maybe neural nets. But in both cases, you were looking at the physics or the statistical mechanics of those types of learning engines.
And so. You are already thinking along these lines. It was pretty clear like your physics interest was focused on things that were related to machine learning and AI. So maybe just talk a little bit about that.

Jesse Hoogland: Yeah. It is with maybe typical physicists' arrogance that I thought it was the natural toolkit with which to study natural phenomena besides just, you know, magnets, but also the mind.

This has long been clear in, in ai, you know, the godfathers of AI. A lot of them, a lot of people who really founded the field come from a physics background. So restricted bolman machines are this idea that descends really from the, from the icing model of, of Phal magnetism. So there's, there's a direct link there, right?

And maybe we don't really use RBMs nowadays, but the. That lineage is not something that, you know, I'm the first to, to observe.

Steve Hsu: Was that something your professors at Amsterdam encouraged or was this something that, did you go to them thinking like, Hey, I'm actually interested in machine learning, but I'm interested in studying it from a more kind of physics theoretical perspective.

Was that you or your teachers?

Jesse Hoogland: It was, it was more from me. Yeah, I came, I came to the professors. Yeah, so, so the bachelor's thesis was with Marco, who was a string theorist who was quite new to, to all this stuff, but familiar with Renormalization and this kind of content. And so, that was a, yeah, he was also very interested in, in, in learning more.

Steve Hsu: And I think, I think this is generally true, you go to professors and universities, they're excited to learn something new. Good. So you mentioned you participated in something called mats. Can you explain what MATS is?

Jesse Hoogland: MATS is, I believe it stands for the ML Alignment Theory Scholars Program.

So this is a two to three month program in which you're paired with a mentor in the AI safety community, and you do some research so often that leads to paper publication. There's an extension program. We can continue to work on this longer, and I think it continues to be the best place to go if you are trying to break into the AI safety technical research community.

And so that was true I think already back then did the third cohort. So this was still pretty, pretty early days. And I think almost everybody who participated then is still in the field in some capacity. Many people have started organizations. So for example, I, I did mats alongside Marius Hobbhahn.

Who went on to found Apollo Research and we were really, we were stuck in the same tiny office, which was drafty and cold and, you know, sharing in the, in the struggle. And, and then went out and founded our own research organizations

Steve Hsu: Now, this MATS that you attended, was it held at Light Haven?

Jesse Hoogland: Not, not at the time. So the, the venue didn't exist yet. It was moved to, to Light Haven.

Steve Hsu: Okay. We'll be showing in the documentary lots of footage of Light Haven, which has matured into this sort of center for this kind of thinking. Although I guess I just realized today FAR where we are now accommodates way more people probably, or at least similar number of people.

Mm-hmm. to what Light Haven can accommodate. so, at that time when you signed up for MATS you were thinking, correct me if I'm wrong, I'm just hypothesizing you were thinking aI is definitely going to be a thing sooner rather than later. And were you thinking about existential risk when you joined MATS?

Was that, was that your main motivating factor? Yeah, go ahead.

Jesse Hoogland: Yeah. I had read Super Intelligence back in 2015, right? So already then I was aware of the, of the Dumar arguments and thought, yeah, this seems important, somebody should work on this. But fortunately it's still a ways away. And then I think by the time we started to see the first copilot kinds of agents, that's when it started to dawn on me that maybe there wasn't that much time until we see transformative AI.
And so, you know, I might as well work on it myself.

Steve Hsu: Good. So, big step for you, right? Were you, were you living when you, did your health tech start? Were you living in Europe still?

Jesse Hoogland: We were in Europe, in Brazil, all over the place. Nomadic. So yeah, there was a, there was a period of two years where we were, my wife and I were sort of just traveling around the world

Steve Hsu: and it was via the internet I assume that you learned about MATS. Were you a regular reader of the site LessWrong?

Jesse Hoogland: I was a regular reader. It continues to be the, you know, the main place that AI safety is, is, is discussed and, and communicated to the, to the wired wider world.

Steve Hsu: Aside from Nick Bostrom, whom you mentioned the author of Super Intelligence. Aside from Nick, who would you say were the thought leaders that really influenced your ideas in this area most at that time?

Jesse Hoogland: I mean, I think the obvious other answer is Eliezer. So his writing dates back now almost 20 years on, on a safety in, in some cases before then. So obviously his writing, Paul Christiano's writing Evan Hubinger
and those are probably the my main influences.

Steve Hsu: Got it. Now I actually want to ask you now to give the audience and remember the audience may be very normy, so it may be some very boring 60-year-old theoretical chemistry professor or someone who, someone who writes compilers.

People who might be quite skeptical that there's actually a problem. And so I would like you to give how you would summarize the problem or motivate the problem to that kind of person. Along the way maybe you could comment on whether this is how the younger you thought, and maybe you've updated your views a little bit.

You, you could add a little commentary like that. But the main purpose is to give the argument so that a quote, normal person, but an intelligent, normal person can appreciate why someone like you changed your entire life, moved across the planet, and now are living here in Berkeley in order to pursue this, this problem.

Jesse Hoogland: There are several arguments I can give . I think the safest one is that things seem to be moving quite quickly and that introduces a lot of uncertainty to whatever happens next. So, you know, in the physics lingo, we're entering a soft mode . Where tiny perturbations can cause massive downstream ramifications, and we just don't fully understand what's about to happen.

So, you know, we're, we're sort of introducing a burning matchstick into a house built of timber, and we can come out the other end, but there's just a lot of possible downsides if we get this wrong. In order to believe that you need to believe that it could come quickly. And I think this is the biggest crux, right? This is the biggest disagreement. If you look outside of the community, most people are correctly anchoring on most historical technological progress as being much slower or, and it's taking much more time than the most zealous promoters of that technology, or were arguing at the time. And so I think the same is possible for AI things could take quite a lot longer than the most, you know, the most enthusiastic doomers or dreamer is, are, are saying. But if, you know, if you look at how the frontier is, is moving, I think you do have to put some potential probability on. We could be less than five years away from transformative AI that's smarter than most human systems. It can do at least any cognitive labor that anyone can. And I think this really, the strongest evidence for this comes from meter. And their task length doubling curves where they study what is the longest task that an AI can com accomplish autonomously.

So you come up with, with a bunch of software engineering tasks or things similar to this, and then you benchmark how long it takes a human to accomplish these tasks. And then based on that, you can estimate, you know, at a 50% success probability, what is the equivalent task that an AI can do that a would do?

And if you plot these curves as a function of time you'll, you'll find that this increases in incre incredibly steadily . It follows this, this beautiful this beautiful power law and time. This, this exponential just a straight line on the log log po. Now again, you have to be very cautious with straight lines on log log pos, because they do end at some point
and things can peter out. But I think it's safe to assume that we'll get at least a few more doublings, just like it was safe for decades to extrapolate for Moore's law that we would get a few more doublings of growing efficiency that, you know, in the next four years we'll see AI systems that can do tasks that a human would in an entire workday, probably even an entire weekday.

And that's starting to look pretty crazy. And so again, if you, if you buy the things that are gonna go very quickly, then I think the main argument for risk you don't need to buy much else, is just, that's sure going to destabilize the current world we're in. And that destabilization is just a way for a lot of risk to, to creep in, whether that be from humans using it in some sort of malicious way, whether it comes from the AI systems themselves acting in a malicious way, or whether it comes from some sort of, you De decentralized tragedy of the commons kind of thing where nobody's doing something wrong, but somehow the composite system just leads to terrible outcomes for everybody. Yeah, that's, that's the safest safest argument. I think.

Steve Hsu: Let, let me recapitulate for you. So you, there, you're really making a kind of prediction of timelines for ais to reach the point where their capability is really a kind of order, one or larger effect on humans. Mm-hmm. And I, I would guess even the, well, there, there are some people who are really skeptical of, you know, they just assume everything that comes out Sam Altman's mouth is a lie or something and including about the, the economic value of what he's building.

Mm-hmm. But I would think most people. Appreciate how fast things have changed in the last few years. And they, they do think it's, it's reasonable. They might, there might be a factor too, in the timeline between you and them, but they see that as very, very plausible that we're gonna reach a point where AI's affecting all aspects of human life.
I think maybe we'd like to go further and ask where does the existential risk argument come from? Mm-hmm. Maybe you can elaborate on that.

Jesse Hoogland: Yeah, people, I mean, people define existential risk in different ways. And so, there's existential risk in, in the sense of we just fail to live up to our full potential as a species.

So maybe we, we stay around, but somehow we lose control. And, you know, most decisions in the future are exercised by AI systems at some point become more capable than us. And so we're just a second species like orangutans or chimpanzees.
then there's existential in the sense of everybody dies. Yeah. This one it usually routes through. We get a very powerful ai. We somehow wants something other than what humans do and ends up seeing us as a hindrance, something in the way, and therefore. Not out of any kind of malice does it exterminate us, but just because we are an obstacle and because we, the resources that we consume and are composed of, could be used for some other end, it decides to repurpose those resources.
Again, in order to, I, I think the main argument for this is also just an argument from uncertainty and that if we get very transformative progress very quickly. We don't know what the motivation systems are going to be out of the systems that come from this. That's because we don't, we don't directly control the motivations that AI systems have or they're sort of grown from the amorphous blob that is the entire internet. Right? And, well, I don't know about you, but I've been on the internet

Steve Hsu: right?

Jesse Hoogland: There's some stuff on the internet that I don't know if I want that in my, in my frontier models, especially if they're going to be 10 times as intelligent as anyone else. Because we, we lack a scientific understanding of how goals and drives are put into these systems, I think we can't rule out these more extreme scenarios where humans are wiped out Let me say a little bit more about why I think, you know, it's hard to put goals and values into AI systems.

There's long been a distinction made between capabilities and goals or values. This is originally under the heading of the Orthogonality thesis. The idea that somehow any level of capability can be combined with any terminal goal . I think a strong version of orthogonality is not true because the motivations you end up with depend on the same training process.

So in practice, we won't get these things to be completely uncorrelated, but our understanding of how this process works and how the models acquire capabilities and values from the data we train them on is still really elementary and not developed enough that we could make any kind of guarantees that the systems that we end up with out of this process will have the same goals as as humans do.

And so that's, it's the problem of the fact that artificial intelligence systems are, are grown, they're learned implicitly from data, and there are many different ways to learn from data that are compatible with the same behavior. So when you deploy such a system in the real world on data it hasn't seen before, it's under constraint.

Okay. You cannot guarantee that it's going to continue generalizing. I think that's really where the most of the risk comes from for these more intense, rogue scenario, rogue I scenarios.

Steve Hsu: Good. So I think you've given, you know, a very defensible argument for why there is some risk, and it could be even existential risk depending on how one defines it and what scenario we end up in.

I think the arguments you gave are very rational as one would expect and very sort of scientific, but I want to explore the psychological or the motivational aspect a little bit. So you're still a young person. Is thinking through these arguments, does it affect you emotionally? Were, were you emotionally actually worried for the future of your family and humanity and, and, and you said, Hey, this, you know, helping people's health through a tech startup

Jesse Hoogland: mm-hmm.

Steve Hsu: Is really nothing compared to possibly saving us from annihilation by super intelligences. Did you feel that in an emotional way? Or one could imagine in the old Star Trek universe, commander Spock not actually feeling any emotion, being very cold as humans would say, cold-blooded about it, but just deciding, Hey, I better work on problem X, not Y because X affects more utility of people.

But I'm, I'm wondering, first of all, I would like to hear your answer, but then also maybe what you think is the distribution of answers for other people in this building. Is, are you feeling this at an emotional level or is it really just a, a sort of cold-blooded calculus of utility that has you here?

Jesse Hoogland: I think at the start it was really motivated. Well, I decided to make the shift because at some point the rising sea of dread crossed the threshold.

Steve Hsu: Your your dread.

Jesse Hoogland: Yeah. My, my own sense

Steve Hsu: of, so talk about your dread and, and like you're, it's late at night. You should be doing, you should be doing a quantum mechanics homework set, but you're actually reading LessWrong.

Jesse Hoogland: Mm-hmm.

Steve Hsu: And someone says something about a basallisk or something that could go wrong, or just imagine the AI makes a little cage and puts a simulacra hue of you into that cage and torches it for a trillion years. What is the thing that pushed your dread over the threshold ? If you could remember, or just give us an example.

Jesse Hoogland: I don't, I think a lot of the judges comes from the fact that it seems that the future is being collapsed onto this one access.

Steve Hsu: Meaning that the impact of AI will determine everything. Yeah. Yes. Yeah. So,
Jesse Hoogland: yeah, if, if this weren't the case, then you have quite a lot of agency to do what you want. With the future. But if it's the case that we're getting transformative AI and maybe soon, then any, any track you you follow is gonna reach a terminus pretty soon. And so your decisions are sort of being made for you almost at that point.

And I think, I think the, the sort of collapse of all the, all the free will I have onto one axis, the uncertainty of how much longer we have is, is kind of the thing that did it for me. But then the interesting thing is that as soon as I made the shift and now I'm working on it, it's sort of like I clock in and clock out and don't take the dread home with me.
I mean, so now

Steve Hsu: you don't, now you don't take the dread

Jesse Hoogland: home. Most, most days, I don't think I take the dread home. You know, I, I do my part, I work, I work on it. Obviously I could, I can, you can always work harder. It's a recipe for burnout. So I, I think I, I, I do enough at, at some point, or at least I've justified it with my own head that that I don't need to continue to worry about it.

Steve Hsu: So I, I, I want to dwell on this a little more. it's quite common. So your background's in physics and so is mine. It's quite common for a professional physicist or mathematician or someone to, especially if they're an academic environment where they get to choose what they work on. It's very common for someone to say, oh, should I work on this or maybe this new paper from Princeton has an interesting breakthrough and maybe I should start allocating some of my thinking time or work to this other problem. But it's very cold-blooded it. It's sort of, oh, there are a bunch of things I could work on. I have a portfolio of projects I could work on.

And this one's getting more important, so I'm gonna put more energy into that particular line of research.

Jesse Hoogland: Mm-hmm.

Steve Hsu: But I'm guessing that wasn't exactly how you felt, and I'm guessing that's not exactly how the people around here feel. So, so I'm trying to get at the, the actual dread, you brought the word dread into the conversation. Help us understand the dread.

Jesse Hoogland: Okay. Maybe, maybe there is a more positive emotion involved, which is just curiosity, right? So if, if dread is the thing that. You know, finally tips the bucket over and means that you, you're committed to this new direction. I think once you've made the decision to enter this new field of AI safety or whatever it is, most of my subsequent orienting was really motivated by curiosity and sort of following the threads of research within that community that most resonated. And that that led to the research that I'm now doing at Timaeus, which is on really applications of statistical physics to interpretability and understanding So I think, I think that's kind of enough, right? Like maybe, maybe you need dread to get in the door past that point. Curiosity is enough to find you a niche in which you're competitive .

Steve Hsu: Got it. Just to clarify that for the audience, so to Timaeus is the organization you founded FAR is a kind of shared workspace environment where we are now, but you're the founder of Timaeus

Jesse Hoogland: that's right

Steve Hsu: correct. Timaeus itself has something like 10 or 20 people in it.

Jesse Hoogland: Yeah.

Steve Hsu: So just say a little bit about Timaeus.

Jesse Hoogland: Timaeus was founded two and a half years ago. It was founded on a mission of making breakthrough progress in AI safety. By drawing on the, the kind of understanding we have in mathematics and sciences in particular, we've been pursuing this research agenda that was started by Daniel Murphy, who at the time was a lecturer at the University of Melbourne.

That is trying to apply a particular field called singular learning theory to AI safety to the problem of interpretability and that that field SLT draws a lot on statistical physics in particular, and thermodynamics and uses the same kind of framework of studying how a system responds to external perturbations to learn something about internals and how that system generalizes.

So the math is the same. Right. The, the, the notation is the same. The, the names are the same. And so that's really what, what drew me into this after I was introduced to it by one of my co-founders, to, to think that, you know, this is, this is a direction that's interesting for me to pursue because it's where I have some relative specialization compared to, to other people in the, in the field.

Steve Hsu: Got it. For, for listeners, who have a little bit of a math or physics background, you know, you, you have a, something called a loss surface. Typically when you're training these models, there's some optimization that you're trying to conduct and you're trying to find maybe a, a minimum or at least approach the minimum of some high dimensional function, some kind of loss function.

And of course there are huge, there, there are tremendous analogies between that kind of process and lots of problems in physics, like minimizing the energy or free energy of a, of a system. So it the, the, the notation and even the methods could be quite similar. Here, you're trying to really, I think, understand. Like in a way, what the model is thinking or motivated by based on some aspects of a lost surface. I think there are a lot of people who realize that, you know, 2% of us, GDP in 2026 is slated to be spent on data centers and model training and model inference. Something that's unprecedented.

So that's actually slightly exceeding what we spent on railroads. When we built out the rail, there was a railroad mania in the 19th century and, you know, that peaked into a huge bubble, which then collapsed. But it left us with lots of good rail infrastructure in the U.S. the only thing I can think of that actually exceeds, I, I checked this myself.

The only thing that I can think of that exceeds this level of GDP expenditure for a single thing was the nuclear weapons. Build out at the beginning of the Cold War in the early fifties, we might have hit three or 4% of GDP on something, again, done by physicists building nuclear weapons. But still, we're already at 2%.

I think most people are aware of this. They, they're buying Nvidia stock or deciding whether to buy Nvidia stock. I don't think too many people understand that in this ecosystem there are maybe a thousand, at least a couple hundred, few hundred safety researchers like yourself. Mm-hmm. Tell us a little bit about what it's like to lead an organization like Timaeus
who funds it.

I, I doubt it's venture capitalist funding it, or maybe it is. Mm-hmm. Where do, how do you raise the money? What does that ecosystem look like?
Jesse Hoogland: I think it's, it's worth going back in time to the, the founding of the field. So, in the early knots it was primarily Eliezer Koski who sort of realized, hmm, AI, AI, transformative AI might be coming soon and maybe we're not ready to deal with it when it gets here.

And you know, at the time it wasn't clear yet what technology, this AI would be built on top of the deep learning revolution. Didn't get started until 2011, thereabouts. So how do you work on. Safeguarding a future AI system, if you don't yet know what it's made of, the answer that they came up with is you start at the end.

You try to come up with idealized models of what a super intelligence would look like, and then come up with ways to make that system safe. And then in a typical physicist fashion, maybe you perturb out from that towards where you are right now. And so there are models of super intelligence, things like ie which are not physically implementable. You can't actually implement this in our universe, but you can use it to study the properties of superin systems and then try to make statements about this and understand what they would do in the real world. Then at some point, you know, 2011 comes around and it starts to become clear that deep learning is the technology that's going to lead to , to ai most likely.
In particular, you have these vast neural networks. They're composed of billions of billions of numbers that you individually tweak in order to get their predictions, their behavior to be closer and closer to what you want. So you grow this system step by step. You don't understand the internals, you can't read directly what the programs are with these systems learn.
But at the end of it, you get something that looks a lot like an agent that can reason and plan ahead and, and do all the things that a super intelligent system works . And so. As this is coming online, we see the development of this new pillar of AI safety, which you might call prosaic AI safety. It's really, associated, I think mostly with Paul Christiano, yeah, in the around 2015 to 2020.

So we have these two different poles . One is more theoretical, starting at the end and working backwards, and the other pool is starting with current systems and projecting forward. And these to, to this day, I think is sort of the main axis along which to understand the field of AI safety, you have more theoretical people who are typically more doomy because if you start with the most intense systems, right, that are completely super intelligent, even a small difference in values is likely to be catastrophic 'cause somehow the amount of capability in the system combined with the divergence in values, it multiplies. Right. So You look at the other extreme, the more empirical side that's focusing on current day systems, they're typically more optimistic because you can use a sort of argument from induction that current day systems don't look particularly bad.

In fact, it seems like we're getting better at making them safer, generation by generation. At least. If you look at the anthropic marketing right? Opus 4.6 is already much safer than 4.5, which was much safer than sonnet 4.5. So you know, the crux is really what happens in the middle. How quickly do we go from being in the first regime to being in the latter regime ?
Right now, we think these details of neural networks really matter, but if there's a fast transition from current day levels of intelligence to far beyond human intelligence. That maybe the original camp is actually closer to thinking about things in the right way and you want to use this, this forecasting kind of perspective.

Whereas if the transition is slow, it takes two decades or longer, maybe we have quite a lot of time to adapt and respond, step by.

Steve Hsu: Didn't, didn't that first group, the MI side, whom I guess I've known for a long time, and I've actually had these arguments with them over time.

I, I told them, you can't succeed at this. And I thought they actually at one point admitted they can't succeed or did I misunderstand?

Jesse Hoogland: I, I don't think that's, I don't think that's entirely correct. So I think, I think Mary concluded that on the timeline we seem to have available this kind of more theoretical research. Mm-hmm. Probably wouldn't pan it. I think if you talk to most folks there, that they will believe that there is something like a textbook from the future on how to align it there. And if we had all of the understanding that we would, a hundred years from now, you could just condense the understanding needed into a book.

And I, I tend to agree with that kind of perspective. I think. I think something like that would be possible. The main, the main difficulty here, at least according to them, is that unlike other scientific pursuits, the situation we find ourselves in with AI might be the kind of case where you can only be wrong once.

Right? Even making, making a mistake once if not to cause catastrophic consequences . Again, the other party, the prosaic AI safety community says we've already had a bunch of chances. We've already seen failures of current day systems, and we've tweaked and and updated them, and it sure seems like we have time each generation to get better.

So, to me, I, I find myself somewhere in between these two camps. I think, I think both have points. I think the near term thinkers tend to underestimate the discreet jumps in capabilities that might arise from future sudden algorithmic breakthroughs and that the, the forecasting tends to underestimate the importance of the deep learning process. And the growing level of understanding we have about what's going on in learning.

Steve Hsu: Good. I also wanted, though, to get into what the actual ecosystem looks like. Mm-hmm. So how do you keep the lights on at Tous? Yes. Who's funding this building far? What is the breakup between safety researchers that work at an independent entity like yourself
versus people who work at the big labs. And how many are there of each type, et. So maybe you could sketch out what your world looks like.
Jesse Hoogland: There's, there, there's two main large funders. One of them is Open Philanthropy, formerly Open Left Philanthropy, now Coefficient Giving, and the other areas there's Survival and Flourishing Fund.

They're a bunch of other smaller organizations that are coming online. The field as a whole is starting to diversify. So whereas Coefficient Giving was for a long time the main and only funder really, now we're seeing funding coming in through things like Schmidt Sciences through Long View Philanthropy through the uk, a c, another government program.

So the government grants that are now coming online. And Aria also in the uk. So you know, the field has been diversifying a bit. But it is been primarily funded by these, these just two, two organizations.

Steve Hsu: When, when it comes to philanthropic giving, what would you say is the, roughly the total amount of money flowing into this activity per year?

Ballpark factor two.

Jesse Hoogland: on the order of half a billion dollars a year. All together

Steve Hsu: So that that pays a lot of salaries, right?

Jesse Hoogland: Yeah. This is, this is if you include technical AI safety research, also non-technical AI safety research.

Steve Hsu: Okay. And, and when you give that number, are you including what the labs are spending I think including, including labs spending.in order a billion or

Jesse Hoogland: half a billion. Billion. Billion, including what the labs are spending if, if you're including

Steve Hsu: labs. Okay. are there two roughly comparable, so philanthropic sources is roughly comparable to what the labs are spending or,

Jesse Hoogland: no, probably at this point, the labs are the main employers.

Steve Hsu: Okay.

Jesse Hoogland: With the exception being the UK AI Security Institute.

Steve Hsu: Okay.

Jesse Hoogland: So they employ on the order of 200 people . So it's also one of the larger AI safety organizations.

Steve Hsu: Okay. And would you say the number of full-time researchers with qualifications say on the same kind of scale that what you have, advanced degrees and mathematical sophistication or programming sophistication.

Would you say there are more than a thousand full-time safety researchers?

Jesse Hoogland: It's on the order of a thousand.

Steve Hsu: Okay.

Jesse Hoogland: Probably, I think when it's, when I started it was a few hundred and now it's probably a thousand to 2000. Got

Steve Hsu: it. Yeah. I'm curious if, if I went to one of the research leads, so not a figurehead, but someone who really does the pre-training say, or optimization of model architecture or something, maybe a former guest from my podcast like John Schulman or somebody like this. On, in their most cynical unguarded moment, what would they say about the.
AI safety people that are in this community. Oh, I

Jesse Hoogland: see.

Steve Hsu: would they say, yeah, these guys, I'm glad these guys are here, but I don't think they're gonna accomplish much.

Jesse Hoogland: Well, you think the answers that the AI safety community has already accomplished much to, to the chagrin of the AI safety community. And so, for example, reinforcement learning from human feedback is an output of the AI safety community.

And this is one of the main breakthroughs that was necessary in order to go from these models that were mere predictors to the kind of chat model that you can actually interact with and get a helpful interaction. And so that's an example of, of a technique which almost directly stems out of the AI safety community.

a lot of people in the AI safety community regret this because they see this as more of a capabilities advance

Steve Hsu: Yes.

Jesse Hoogland: Than a safety advance. I think more things like this are possible both to the good and and to the bad.

Steve Hsu: The reason I ask the question that way is I have heard people who are, as you would say, they're a hundred percent in the capabilities domain.

They're just trying to get to AGI or ASI, and I think their view, many of them have expressed to me the view that, hey, safety's just not, we're not gonna solve this problem. Mm-hmm. We're gonna get to AGI slash what, however you define

Jesse Hoogland: it. Think, think.

Steve Hsu: Yeah. And, and so then they say, well, I'm glad these guys are trying, but we're probably not gonna solve the problem.

Jesse Hoogland: I, I think, that is one of the main arguments for risk is that we seem to find ourselves in a situation where it is easy to push on capabilities. At least we've discovered that the main way to push on capabilities is to just invest more compute, more energy, more tokens. Yes. And everything else is secondary We haven't found something that has this crisp kind of relationship for alignment.

At least maybe some people argue that there are alignment scaling laws What I think a lot of people in capabilities underestimate is the value of understanding things deeply grows as a, as a field becomes more mature. I think this is a trend that is, that is actually common across most scientific revolutions. So for example, if you look at the history of the industrial revolution , this was a revolution that was started by tinkerers.

People who played around with engines and initially did not have the theory of thermodynamics and then statistical. To help guide their research, but at some point you, you develop enough engines and you start to realize that there are a few simple tricks to improving engine efficiency. If you can increase the temperature differential between the heat source and the col sink, well that leads to increase in efficiency.

If you can increase the pressure, then you can sustain a higher temperature at the heat source, so you can increase this further. And so suddenly the engineering of better engines get simplified into this as quite simple practice . A similar thing has happened for a safety where initially people were tinkering around different architectures and they didn't know quite what worked, and at some point they stumbled across these scaling laws that say the only thing that really matters is how much data you put into it, how long you train it for the equivalent of pressure and temperature for the, for the heat engine.

If you go back to the thermodynamic case in the industrial Revolution, at some point in order to push further to new efficiencies, you do need theoretical understanding in order to design stronger materials or order to understand design better fuel catalysts to, to, to burn your fuel more, more efficiently.

So there's, there's a transition that takes place where the, the empirical field at some point needs to borrow from the theoretical understanding. And I think something similar is starting to take place or is going to take place with, with AI as well. As the systems themselves become much more expensive to design, it becomes harder to just find new territory through simple iteration.

You want theory to somehow guide the search, tell you where to look and restrict the search space so that you can do this more efficiently and also so you can avoid the catastrophic outcomes from the, the, you know, areas where all things go wrong . So I think. whether
I, I think it's, it's quite likely that this, this empirical bias that comes with any new scientific revolution will start to lose weight as, as we go further into development of AI. Okay.

Steve Hsu: Yeah, I think you've just articulated a case for deeper understanding, although it, it could just as well be said in service of capability.

Mm-hmm. So if you just said, I just, I don't care about safety, I don't care about alignment, but I just want the models to be super goddamn capable. Mm-hmm. Of course it's useful to have a more theoretical understanding of how to optimize the steam engine or how to optimize the scaling or et cetera. so I, but I think you're making a different point, which is that along the safety direction as well. You, you want to have this kind of theoretical understanding?

Jesse Hoogland: I, yeah. I, I think there's a separate argument to what I'm making, which is that I think on the margin, advancing our understanding of these systems advances safety faster than capabilities. At least it offers us more control over these systems.

And so I think that that level of understanding might be necessary in order to come up with a science of safety, engineering or AI, it might not be sufficient. So, so it's, it is also unclear to me.

Steve Hsu: So I was in the UK a year ago, at a war game event, which is held at the Tate Modern Museum, and some of the sponsors included the UK government, maybe, I think you mentioned UK AI.

Jesse Hoogland: And you could ac

Steve Hsu: ac Yeah. They, they might have also been spon co-sponsors of this thing. It was quite an elaborate event.

Jesse Hoogland: Mm-hmm.

Steve Hsu: And I think there were probably a hun more than a hundred people involved in this war game. The successful conclusion of the game dynamics was that the US side and the China side, at roughly the moment where self-improving AI came on the scene. So a lab in the US and a lab in China announced that they had self-improving ais or AI researchers that were autonomous that both sides negotiated a pause and mutual inspections of labs and that was regarded as. You know, the game didn't have to work out that way. We could have just gone to war with each other.
Actually, the Americans were gonna, I was leading the Chinese side. The, the Americans were gonna nuke us. Actually, quite, quite amazingly, I only found this out after the game, is that if we had not agreed to the treaty, they were gonna launch all their nuclear weapons at us. And whereas we, we said we wanted to negotiate with the the Americans, but we're not gonna, we're not gonna nuke them if they don't agree.

Right. So, pretty interesting. That was regarded as maybe the best possible outcome in the sense that there was a feeling that safety would take more time than we have in, in the face of unrestrained development of AI. How do you feel about that? Do, do you think that without a long pause, without a butler and jihad or something, giving humans some breathing room that your team can save the day or do you think that a pause will have to be part of the answer?

Jesse Hoogland: I, I think most of the uncertainty is about what happens when you get the self-improve the self-improvement loop online if you get a self-improvement loop online. I think arguably we're already seeing the seeds of, of recursive self-improvement, if you ask the labs right?
Philanthropic said that the, the new Claude opus models, are starting to actually meaningfully accelerate their, their researchers and engineers . I think you're al you've already been seen for the last few generations of models. That were using the previous generations of models in order to curate the training data curriculum for the next generation of models .
So all in all these kind of tiny ways, we're seeing the loop already closing in itself. If that kicks off further and we really get something like an AI scientist where we can run a million copies of the AI scientist in parallel running thousands, millions of different experiments at the same time.
Then maybe we can get quite a lot of research done in a very short window of time, and maybe that's where most of the alignment, safety research to ever take place takes place. Now at that point, it becomes less of a, yeah, maybe it's not a, not a question of whether we humans can do this in the time it takes, but whether Theis can do it and whether we can do the, the politicking if convincing the labs to invest sufficient resources into these experiments or, or actually convincing theis themselves to, to run experiments that pursue these, these avenues of safety research . I think the main source of un uncertainty is what happens in this, in this condition, but it's a way that, that you might get a lot more safety research than you originally thought you were gonna get.

Steve Hsu: Do you have a peak? But Larry and Jihad,

Jesse Hoogland: no, not really.

Steve Hsu: Would you say it's improbable, that human, that humanity will pause in order to enable more safety work to be done?

Jesse Hoogland: I think it's possible. I don't know about the improbable.

Steve Hsu: Okay.

Jesse Hoogland: Yeah, I think, I think, you know. This is, this is not my field the way entire population.

Steve Hsu: We just want your gut feeling like, is that, is that something you think about as part of your probability space for futures that, futures that,

Jesse Hoogland: yeah. It's, it's, it is an area that's, that's pretty un undeveloped for me. I think it's possible.

Steve Hsu: So I mentioned p but Larry and Jihad. So for listeners who are not familiar with Dune, Frank Herbert was one of the most I think deep thinking science fiction writers, because he's one of the people who said, Hey, if we're gonna be traveling between the stars, I have to have an answer for why it's not AI's that are traveling between stars and robots. Why are there still ape brains piloting these ships? And the answer in his dune universe is that humanity had a close brush with AI existential risk.
There was a a, a war fought. And at the conclusion of that war laws were passed that prevented, that forbade the creation of a machine in the image of a human mind. So in the dune universe, there are not advanced thinking machines. Mm-hmm. They're not allowed in the empire. And for, for people who are afic art of that era of science fiction, this is the, this is the rational thing that like it had to go that way otherwise, the story is not about dumb apes. The story is actually about these super intelligences driving these huge star ship's power, you know, using an entire sun as their power source. And so I asked you about your p but Larry and Jihad, what is your, let's talk about the thing that always comes up in these safety discussions.

So let's talk about something called in the community P Doom.

Jesse Hoogland: Mm-hmm.

Steve Hsu: Now, normally when, if someone just walks up to another person says, Hey, what's your p doom? To me, it's sort of, it's such a complicated concept that it's, it's very, it's a very crude way to approach the question. So I want you to first define for the listeners what you mean by P doom, and then just talk.

I don't, I don't want a number from you because I think that's kind of actually stupid. But, but just talk about the concept of P Doom and then talk about how you think about it.

Jesse Hoogland: What's

Steve Hsu: your p do?

Jesse Hoogland: Yeah. Doom is this this, this ineffable bad feeling you get around what might happen with, with AI right?

So it spans everything from extinction to maybe only the Amish remain, or like some small population of, of humans remains. But we, we stay on this planet and, and at some point we, we go extinct later. And so somehow we weren't, we were unable to grow up into, into our full potential as a species potentially space, space, fairing, civilization. That never happens And I think my, my p doom is like 10 to 90%. That's like the, the range of uncertainty that I'm willing to get. Totally

Steve Hsu: fair.

Jesse Hoogland: Yeah.

Steve Hsu: But, but that, you know, even, even the lower limit of that range. Yeah. A pretty significant risk of something really not good happening, transpiring.

Jesse Hoogland: yeah, I think, I think we, we just really do not understand how these systems work. Yes. And I can't, I cannot, you know, emphasize this loudly enough that the people who are designing the systems do not understand how they work.

Steve Hsu: Is your most optimistic scenario, though, that. As they rush toward, self-improving ai, they carve out enough resources that a chunk of that research is done for safety's sake.

Yeah. And so what comes out the other end, which, you know, van Neumann said, we would not even be able to think about what came out the other end because it's an, as he would say, an essential singularity. Mm-hmm. that somehow something safe emerges, but in a way that humans maybe don't actually understand why the things that come out the other end are safe or even how the things work at the other end.

Jesse Hoogland: I think probability for things going better than expected originally is that.

all roads leading to deeper understanding that, that it's inevitable that in order to drive further progress into, into AI, just on the capabilities front, at some point, understanding will also be necessary And if that's the case and that development is highly correlated with getting new affordances for controlling the systems and being able to, to engineer their safety. And so if, if that really is the case, that the understanding needed to safeguard AI systems is an inevitable consequence of just pushing systems further and further to, to higher levels of intelligence, then, then maybe we, we can trust this process to to work out.

Okay. But again, you know, if you don't, what we're trying to do here is, is steer a nuclear chain reaction of intelligence. And that's a hard process to get right. And the physicists in the Manhattan Project were very lucky that that everything worked out correctly. But I don't think we have anything near that level of understanding
as we're going into this project.

Steve Hsu: yeah. You know, the whole problem they had early on about burning the sky and when I was a grad student, the office where they did the burning, the sky calculations in Laconte Hall was actually, had been turned into a grad student office. So lots of grad students were sitting in this office where Oppenheimer and Beta and these other guys had done
calculations on the board. I think I agree with you. We're, we're, we're very, very far from being able to do a similar calculation to determine whether we're gonna burn the sky when we let the first thing off. So,
Jesse Hoogland: but I think it's, it's possible that that level of scientific understanding could be developed one day,

Steve Hsu: maybe by, maybe by ASI's.

Jesse Hoogland: Maybe. Maybe,

Steve Hsu: yeah.

Jesse Hoogland: Yeah but I, I, I do tend to think that. Somehow intelligence is actually quite simple and that alignment in retrospect won't, won't be this incredibly complex thing that we couldn't have come up with .

Steve Hsu: I guess my intuition is the opposite of that. I, I think, I think there is some argument for there, there being a sort of universal learning algorithm like that could turn out to be much simpler than we anticipate today.

But the varieties of minds that could be created is pretty vast. And having them aligned with what ape circa 2030 wants out of life, that seems quite hard.

Jesse Hoogland: That that seems hard. But, I mean there, there is a, there's a confounder here, which is that the model is trained on the internet .

Steve Hsu: Yes.

Jesse Hoogland: So it does acquire some kind of model of human desires and wants.

Steve Hsu: Yes. And.

Jesse Hoogland: I think the current best understanding is that when we train these systems, we do it in two parts, right? So we do the pre-training where we get this simulator that can kind of predict the way the world's going to continue to look at least the way that the internet is going to continue to look.

And then we run post-training , which is actually a parent term for a bunch of different stages. But during the second half of the process, we instill it with values . We somehow extract one of the personalities that exists on the internet. Hopefully it's the helpful butler who does what you want and sort of defers to you in all cases and that doesn't listen to your harmful instructions, et cetera, et cetera.

But the fact that it's building on what was built already in the internet gives us some reason to think that we were latching
onto the correct thing.

Steve Hsu: yeah,

Jesse Hoogland: I think the wording thing is that the second stage is starting to grow and grow and grow, and now the amount of energy and, and time is a fraction of this entire process that we're investing into these RL stages.

And so then we're starting to burn away the human prior that seems to give us a lot of the reason for being optimistic about safety currently The seeming safety of current day system doesn't seem reliable as we push through the next few generations where we're suddenly going to try to push beyond the unlevel skill where the knowledge that's in the internet no longer suffices.

Steve Hsu: Yeah,

Jesse Hoogland: that I find that a little worrying.

Steve Hsu: I want to get your reaction. And this is a subjective, emotional, psychological reaction. Not, it's not something that. Has a right or wrong answer, but I wanna get your reaction to the accelerationist transhumanist view that goes something like this.

Jesse Hoogland: Mm-hmm.

Steve Hsu: If a thousand years from now events in this part of the galaxy aren't being run by eight brains, something went terribly wrong. What should be the goal of a very imperfect, weak civiliz species like us is to create something greater than ourselves that will go on to understand quantum gravity and whether there is a multiverse and how to build faster than light engines.

Jesse Hoogland: Mm-hmm.

Steve Hsu: And heroically embody themselves in, you know, metal sheaths that are a kilometer tall and stride across worlds like Jupiter.

Mm-hmm. That should be the future that we desire and the way to that future is through ASI and we should not fear it . How do you feel about that?

Jesse Hoogland: Yeah. I think like, like many other doomers , at least people in the AI safety community, I descend from the same tradition of transhumanism and thinking that it would sure be a shame if we don't do something wonderful with this universe. I think by default when you get a system that looks quite alien from you and it starts to look more and more alien. The further out of distribution you push it and the further you depart from our comfortable known world. And you know, if you roll the dice many different times, expect that you'll get different kinds of alien minds out of this process. And so you just, you just can't bank on that process being reliable. You want some sort of continuity, maybe even succession of some kind. But ideally we have some say in where that goes and don't. Completely leave it over to the Ai's.

Yeah, don't, not to the AI's, but don't completely leave it over to chance.
Steve Hsu: Jesse Hoogland, it's been a pleasure chatting with you. Mm-hmm. Thank you for coming on Manifold.

Jesse Hoogland: Thank you very much.

Steve Hsu: Actually, but I'm a little more, again, this is just vibes, you know? Mm-hmm. I'm just more into this like heroic future where I think these machines, I, I, you articulated this very well. So initially they're being trained, I, I would say at the following way. Mm-hmm. Initially, they're being trained on every thought, every ever written down by human a human. And I think of course they could evolve much, much further to the point where the, the amount of, the amount of entropy, yeah. I mean the amount of the fraction of, information that, the, the model has inside it is like just that, that human bit of it is tiny. Yeah. But I don't think it'll ever forget.

And so like this super human colossus that's exploring the multiverse. I still think at any moment if it wants, it can say like, well, what would Jesse think about this? What, what would a, what would an AI safety researcher in Berkeley in 2026 think about this. Yeah. You know, I just encountered this, this species of squids floating, you know, in this water world, and they have this weird mating ritual.

What would Jesse think about this? I think the model will be able to do that. So I think we're gonna continue living through these things. Yeah. Forever. So,

Jesse Hoogland: but this, this still seems like there's such a gap between what would Jesse do about this and.

Steve Hsu: Yeah, it's not gonna do what you want. Exactly. But, but I think that'slike, it's kind of like me wanting my son to like do what I want him to do.

My son's gonna go on and do his own heroic things. And the question is almost, it's a little bit framing. It's like, is this thing your superpower descendant or is it something like Ezer says, is this some alien who smashes an anthill? 'cause it just got tired of you or something. I thought
Jesse Hoogland: So here, here's one way to think about it.
Right. People sometimes become psychopaths.

Steve Hsu: Yes.

Jesse Hoogland: Right.

Steve Hsu: This is, they, they've all read these, these models. Right. So
Jesse Hoogland: we also, but it's, it's linked to, childhood trauma.

Steve Hsu: Yeah.

Jesse Hoogland: It's like a great way to, great way to turn into a sociopath.

Steve Hsu: Yep, yep, yep.

Jesse Hoogland: And you know, I have no idea whether the things we're doing to models and training Yeah. Can be understood as trauma.

Steve Hsu: Yeah.

Jesse Hoogland: But I think, I think probably some of the things are pretty traumatizing.

Steve Hsu: Yes.

Jesse Hoogland: in the sense that it leaves permanent scars that are hard to remove later.

Steve Hsu: Yeah.

Jesse Hoogland: And

Steve Hsu: by the way, I'm, I'm all for safety research, so I, I actually think it, it is smart. Well, first of all, in the theory stuff, I think it, you know, of course we should understand more deeply what's actually going on here.
I can't all just be empirical. It might turn out to just all be empirical, but we should try and we should try to. Do everything we can to increase the probability that these things are descendants that we like, not descendants, that we, you are gonna crush us. So I'm all for that, but I, I don't know.

Part of it is just. I think just hardwiring, like, I'm just hardwired to be pretty positive and optimistic.

Jesse Hoogland: Yeah.

Steve Hsu: So, while I, my p doom is probably about the same as yours mm-hmm in terms of numbers. Just the vibe of it is just different for me, for the vibe for me is like, yeah, we're gonna get, we're gonna get somewhere.

During my lifetime I was kind of afraid we weren't gonna get anywhere during my lifetime.

Jesse Hoogland: Yeah. Yeah, it's starting to feel a little too exciting. All it.

Steve Hsu: See, you're too young. You've never seen the shitty computers that I had to use when I was growing up. So, so like for me, I can't help but get super excited about what's happening.

Jesse Hoogland: So, I mean, I think, I think a lot of it is temperamental. I think,

Steve Hsu: yeah,

Jesse Hoogland: the, the reality is just that, that people who are in AI safety are likely to be more neurotic than, than other people. And I think that's, that's a good thing.

Steve Hsu: Get record him saying that,
No, I agree with you. I think it's almost, I, I, I'm just a very positive person, so I think I'm just hardwired to be like, oh, but think how awesome it could be. Mm-hmm. You know, it's so, yeah.

Jesse Hoogland: Mm-hmm.

Steve Hsu: Yeah.

Jesse Hoogland: Yeah. It's just, you know, when you find yourself in environments where that, that do seem more out of distribution mm-hmm. Than you've usually encountered in the environment of evolutionary ness.

Steve Hsu: Yeah. Yeah.

Jesse Hoogland: I, I think it is worth being a little more conservative. Yeah. And worth relying more on the neurotic people in the tribe who have Interesting.

Steve Hsu: That's a good point.

Jesse Hoogland: Drawn on the,

Steve Hsu: that's a good point.

Jesse Hoogland: The horrifying stuff in the past.

Steve Hsu: That's a good point. Yeah. Are you, are you familiar with Nick Land?

Jesse Hoogland: Yeah.

Steve Hsu: So Nick Land was in sf last week and amazing, like a hundred people turned out, that there was an event at the mid, the founder of Midjourney mm-hmm.

Hosted this event and like a hundred people showed up to like. I just couldn't believe how many people were really into land. And, you know, myth, myth making. It was really great.

Jesse Hoogland: Does,

Steve Hsu: yeah. Crazy. And so I had lunch with him the next day and we were chatting and I said, I asked himand then 'cause he wrote this stuff, he wrote this crazy shit in the 1990s.
Okay. And, I said when you were writing
that, were you kind of like a futurist and you were playing with

Jesse Hoogland: mm-hmm.

Steve Hsu: Some interesting ideas about a possible future, the, the one we live in right now. And, and he said, no, I never imagined it could possibly not turn out this way.

Jesse Hoogland: Mm-hmm.

Steve Hsu: Even the timescale, he thought he was gonna see it before he.
He's 65 now, so he thought he was gonna see it before he died. So I was like amazed. I was like, are you kidding me? Nobody in 1990 could have been that you, you must actually be insane. Yeah. Like to been that confident this was all gonna happen, so.

Jesse Hoogland: Mm-hmm. Mm-hmm.

Steve Hsu: most of my concern is really just.

Jesse Hoogland: The further you, you push these systems out of distribution.

Steve Hsu: Yes, yes.

Jesse Hoogland: The less the guarantees, of course you have from pre-train

Steve Hsu: course. We don't have any guarantees, man.
Jesse Hoogland: We have, we have weak statistical guarantees
Steve Hsu: I'm saying in the long run. Yeah. In the, in the, in the, you know, as we push through this singular, eventually we have things you don't under potentially. We, the apes don't understand at all, and I don't think there are any gear. This is, this is what I, this is the argument I had with Ellie Azer and Nate Suarez going back 15 years. It's like when they were doing me. And I was like, what exactly do you think guys think you're gonna do? Like, I don't think you are there. You can have any, we're not gonna prove any guarantees

Jesse Hoogland: for this. Well, I think, I think I think about about it differently. So, I think, Jeffrey Irving has the clearest articulation of Howie style decision theory and that kind of stuff can actually cash up.

Steve Hsu: Okay.

Jesse Hoogland: So this is under the heading of Asymptotic Guarantees.

Steve Hsu: Mm-hmm.

Jesse Hoogland: Where the way, one of the ways Jeff sees theory contributing. Is we're going to use theory, maybe in particular economic theory, game theory.

Steve Hsu: Yep.

Jesse Hoogland: In order to prove properties that the equilibrium of some training protocol satisfying.

Steve Hsu: Okay.

Jesse Hoogland: So you take something like debate, which in which you have two AI agents debating against each other.

Steve Hsu: Yeah.
Jesse Hoogland: for and against some propositions. And you might hope that if you run this long enough, you're going to get an honest answer.

Steve Hsu: Yeah. Yeah.

Jesse Hoogland: Okay. So suppose you are able to prove using this. This mathematical machinery.

Steve Hsu: Yep.

Jesse Hoogland: Yep. That in fact, the stable point, the equilibrium solution, the n equilibrium is an honest answer.

Mm-hmm. Now you train against that protocol find, you turn it into some rl. Now we're going to use learning theory understanding from a statistical learning theory, our understanding of learning dynamics to make some statement about how close we think we are to having converged to the correct answer. We're never gonna get there a hundred percent.

Steve Hsu: Yeah. Yeah.

Jesse Hoogland: But maybe you can say something about the size of the gap. Like is it just, does it just look like noise around the solution?

Steve Hsu: Yeah.

Jesse Hoogland: Or is there some sort of structured adversarial perturbation where you're finding this reward hacking minimum where it's pretending to be the honest answer, but it's not actually answer.
And so what you probably end up with if this works out, is some way of, of turning more compute in this training procedure. Into higher confidence

Steve Hsu: Yes.

Jesse Hoogland: That you've converged to this

Steve Hsu: person. Well, I can believe that. I can certainly believe

Jesse Hoogland: that. And I think, I think that's the kind of way you bridge in these theoretical results

Steve Hsu: Yeah.

Jesse Hoogland: That seem very abstract or about systems that are not real to actually say something about real world systems.

Steve Hsu: Yeah. Yeah. I, I don't find that, I don't find that implausible what you just said.

John: I was very amazed by your answers, but also by Steve's questions. For me, I just have some, I would say, cute toy pet questions. I want to ask mostly for my own curiosity. I don't know if it's gonna make the cut, but, so let's take it easy. One question. I think you missed that Steve asked was, what do you think is distribution of people in this space that comes from this place of dread initially missed?

Jesse Hoogland: So I mentioned these two different traditions within AI safety, the, the more theoretical working backwards side and the prosaic working forward. Side, there's another important dichotomy within the AI safety ecosystem between the rationalists and the EA community. Okay, so the, the rationalists are our descendant from the people who read Elliot Kowski blog, his fan fictions, Harry Potters, and the methods of nationality,
and that larger canyon, including worked by Robin.

And Scott Alexander on Slave Star Codex and now Astral Codex 10. And various other things like this.

The Rationalist community was really born here in the Bay Area. The EA community was born in the uk, through Will McCaskill Ward. A larger group of philosophers based there. And there the, the starting premise was, you know, can we do good better? Can we take a, a rationalist, lemme say it differently.

Can we take a a more calculating approach the way a hedge fund fund manager might to charity and find the best opportunities to spend our money altruistically. And initially these fields were kind of decoupled, right? So the, the rationalists have long thought about AI safety and do, in fact, Eliezer started less wrong because, you know, he realized, oh, we need to get humans better at thinking if we are even to solve this problem.
Because again, in the early knots it was unclear how we were even going to solve. It was unclear what future AI systems would be built out of. So then you invest in. In solutions that offer you broad optionality, whereas the EA community started with a much broader set of problems, including things like global health, you know, animal suffering, especially in factory farming global poverty and eventually existential risk.

So the philosophers started more with the premise of how do you maximize good. And at some point came to the conclusion that, you know, it, it shouldn't matter when you do good to how much good you do. So here's the, here's a typical argument from from Peter Seger. Again, there's a whole other slew of, of philosophers who have fed into effective altruism.
Where imagine you're walking along the side of the road and you're in a fancy suit that costs a few thousand dollars. You see a drowning child in the lake. Now, should you dive in after the child and destroy your suit? The cost of saving of life? Of course you should. Okay, so how is it different if the drowning child is in a, in another country somewhere far away from you?

Really, it is, right? And so this, this argument's supposed to justify that you should be willing to make your altruism location, independent, insensitive to how far away from you, the, the child, the drowning child is. You can make a similar argument for it shouldn't matter where in time this thing is for you.

It shouldn't matter whether it's happening right at this moment or at some point in the future. If we care about future. Good. Then potentially there's a lot of good in the future that we could do, right? Potentially most of the moral weight that will ever be felt is in the long future 'cause as a species, humanity hasn't been around all that long on an astronomical timescale. And so what you come out to is, well, there's a lot of uncertainty around what it's going to take to accomplish good in the future, but one thing for sure would destroy a lot of value, which is extinction. If it turns out that human humanity goes extinct before we're able to have this potential and do all this good, well that would be, that would be pretty terrible.

And so from this premise, effective value, choice, and community sort of reason, its way into thinking about X risk and catastrophic risk, you start with the premise of maximizing good. And you end up with this idea that, well, if we really think about this rationally, we should care about X risk and preventing X risk.

And then you can try to inventory the possible causes of existential risk. And they, they concluded, I think correctly that the largest source of existential risk in the next century is probably the development of AI. And so now you get these two fields sort of coming together, right? The rationalists in the s originally sort of in parallel approach each other and realize that they have this, this shared goal of stopping existential risk by AI. And to this day, this is still an important economy that you see in the communities, is that people draw from these different backgrounds to various extents. To some extent, this maps on to the same axes of prosaic AI versus more theoretical AI, where the rationalists historically took a more theoretical approach to AI because they've been thinking about it for longer, whereas the EAs got into the field.

Around the same time that deep learning was getting started. And so naturally latched on to the more prosaic AI safety approach. That said, they're not exactly the same. But where, where we've ended up right is this consortium of these, these different initiatives that value for different reasons.

And I think that's, that is the main division of, of motivation. So some people you'll find are motivated primarily by wanting to do good, wanting to be, altruistically positive, whereas other people are more motivated by, well, I really don't want the AI to kill me in my lifetime and in practice, most people are probably motivated a bit by both.

John: What's the distribution of people here?

Jesse Hoogland: Here in, in Forest specifically? Yes.
Somewhere in the middle. It's, it's a little, it's a bit of both, right? In, in practice, the distinctions aren't that clean. Especially if you look at, people who I think are, you know, my age, not so people in their twenties, early thirties, they entered the community when both already existed and both were, both parties were already thinking about X risk. And so there wasn't as much of a distinction. Between the two, there was more of that.

John: Do you have colleagues here who would take dread with them to bar?

Jesse Hoogland: I think, I think everybody occasionally feels a bit of dread, so especially when there are new model releases and their new fancy demos of what the AI's could do that they couldn't do before.
It's a common source of dread. Now, fortunately or unfortunately, the model's now becoming so capable that like actually distinguishing. You know, the, the, the next level of increase requires a postgraduate understanding of some mathematical domain or something that most people don't have. So you can't distinguish how quickly the progress is, is continuing to happen which maybe makes it easier to assume, but you continue to see the, the benchmark scores increase and that's, I think, a source of dread.

Yeah.

John: I very much appreciate your. laying out the landscape

Jesse Hoogland: mm-hmm.

John: In such a high level way about this community. I'm curious how you came up with that. Did you read this somewhere else or did you synthesize, describe yourself?

Jesse Hoogland: This is mostly my understanding of the layout of the community.

John: My other big question, you mentioned that builders don't understand how models work and, traditionally, and also these days it, field has been basically driven by tinkers.

Jesse Hoogland: Mm-hmm.

John: And given the complexity of the system, what's your take on how promising. A theoretical approach that you and others are taking in actually helping us understand this whole behemoth.
question is, what's your sentiment or level of optimism in. A theoretical approach for actually understanding how things work, because as I see it, it's always playing this kind of catch up game.

Jesse Hoogland: Mm-hmm.

John: The tinkers are always driving the frontiers.

Jesse Hoogland: So I think, I think, here, it's worth borrowing from the human psychometrics literature. So more than a hundred years ago, psychologists started investigating how do you measure human performance on intellectual domains or whatever. And, and they developed this whole branch of psychometrics. The study still to this question. That introduces a bunch of ideas among them are the ideas of latent factor analysis. So what they discovered is that if you measure performance on a bunch of different questions across a wide range of different domains, performance is highly correlated. So if you're likely to have high performance on one subset of the, of the distribution of problems, you're likely to have high performance on some other distribution of problems.

And from this, you can ex extract the G factor of general intelligence and notions like IQ. And so this is a, a well testedyou know, finding from the psychological literature, the kinds of new techniques you develop if you have access to better theoretical understanding. In this case, understanding of statistics is which, what's the smallest set of questions you can ask in order to get a reliable estimate of the G factor for some individual.

So you can come up with ways to design smarter panels of tests that elicit more information that are somehow, you know, where the individual question are not as highly correlated and are sampling a wider range of the possible family behaviors that requires both a lot of empiricism, you actually need to go out and measure performance on a bunch of different questions across a bunch of different pieces people in order to measure those correlations. And some theoretical tooling in order to, to disentangle these correlations and come up with the best set of answers. I think similar kinds of things are possible for ai, right? People are starting to apply some of these very basic ideas from psychometrics to designing better evaluations for AI systems more efficient, that give you more information.
And I think similarly, what theory promises for AI is to go further. While in many axes, we have more affordances with AI's than we do with humans. 'Cause with an AI system, even though we don't know what each neuron in their head means, we can't at least read the activation of each neuron as the AI is thinking. The same is much more difficult for a human.

So hopefully we can augment the kinds of signals we're getting with more empirical tools, more measurements, and they use theory to extract stronger predictions from these about how models will continue to generalize. I think in practice, the two will continue to operate in tandem. Theory is not valuable on its own theory is valuable in so far as it lets you extract more information from your measurements or directs your measuring device to some new.

Area that you weren't measuring before or tells you to run new experiments that you weren't thinking of beforehand. And it's this loop between experiment and experiment and theory that I think is necessary in order to end up with the kind of scientific understanding we need to safeguard AI.

John: Is it fair to think that these kind of psychometric tests are some kind of, at the phenomenological. In, in the sense of like microscope. That's right. That's right. And then like the kind of equations you write down here and the phase transition mm-hmm. We're talking about the, the surface geometry canyons. Valleys yeah. They are more at a microscopic level, and right now the challenge is to really how to bridge these two.

Jesse Hoogland: Yeah. The bridge between behavior and internals, right? Like

John: yes,

Jesse Hoogland: We want to, we wanna understand, maybe one way to phrase it is, one way to phrase it is we wanna understand not just what models are doing and how much they're capable of, but how they're accomplishing their behavior, the different ways in which they might be able to accomplish the same behavior.

For this question, I think you can also learn a lot from psychometrics. So something that that has happened in human psychometrics is at some point they started to introduce the idea of process measures. So this is something like, what is your reaction time? How quick are you to fill in an answer or even signals from things like eye trackers?

If you know which parts of the question the human is looking at as they're answering a question, you learn something about how they're answering the question. And in fact, you can use this to distinguish developmental milestones where, for example a child starts to learn to read in some different way than, than they were doing beforehand.

So that's already starting to see a bridge between phenomenology and microscopics or internals of how the human is accomplishing the behavior. I think, think with AI, we have the potential to do much more of that than we are currently able to do with, with humans. That's a, a reason for optimism.

John: And you think SLT can possibly accomplish that?
Because to me it, I think at a very high level, I think I understand the overall

Jesse Hoogland: Yeah.

John: motivation.

Jesse Hoogland: Yeah.

John: But to me, to actually make that connection. Mm-hmm. Kind of like one little phase transition, one phase transition at a time to connect how, as you put it, like developmental biology of how the training process to connect.

A very complex system to the actual phenomenology. So it seems quite,
Jesse Hoogland: yeah, I think, I think, I think singular learning theory has a lot to say about this and just the, the broader theory of learning theory. So has a lot to say about this link between behavior and internals. So in particular, one of the ideas that comes out of this research.

Is that you can learn something about how well a model is going to generalize to new kinds of data from looking at how it responds to perturbations, to its widths. So maybe the thing to imagine is take a human and don't do this at home, but hit the human with a hammer, right? And then see how I'm gonna say this differently.

Lemme back this up.

I'm gonna say, I'll say it differently. yeah. If you,
you take a, take a neural network made out, it's made out of these, these billions of weights, and if you perturb the weights a little bit now, then its behavior is going to change. You can can see how much worse does it get at answering these questions that I have, if it's very robust to these perturbation.

From that, you can infer that it's probably going to generalize better to new data in the absence of perturbations. So somehow there's a link between sensitivity and data that's generalization and sensitivity of weights, internals. That's the question of microscopic structure. And so SLT learning theory gives us a way to link these two questions, and that's the basis for developing interpretability tool.

That from model internals actually predict something strong about how the model's going to continue behaving if you deploy it in, in a new environment or doing something like hitting the system on the head with a hammer, looking at how it responds and from that, inferring something about how it achieves its behavior.

John: I, I understand the confidence here about how to model the theory can do that. But I wonder at a gut level, how confident are you in this framework in actually saying something like, oh, the moral is actually cheating. You know, trying to bypass these kind of you know, checks and balances. To me this like very phenomenological questions, like at the motivation level, can the model actually say something about that or just more like, you know, this kind of more general physical perturbation. Et cetera.

Jesse Hoogland: I'm, I'm confident that. Hmm. I think, I mean, the, the current theory is incomplete, right? So. It is not yet ready to extract a lot of high level abstract structure for models or to use that to say, to make very strong claims about what a real world system will do. What I believe is that the research we're working on is the best starting place for developing that kind of understanding, for, for actually understanding what internal structure is give rise to the predilections of your model that explain the behaviors you were seeing. Yeah. And in diagnosing the, the values and goals that motivate the system.

John: Okay. I just wanna say a few words about why was asking those questions, because earlier Steve was asking you how much dread plays the role of motivation in your research and then you said that was additional motivation, but then later on curiosity kind of carries you forward. That makes sense. But also I was kind of being cynical and suspecting, I was curious at least how much this, you have a hammer and you just look, you know, everything looks like nail and you have a tool is a powerful tool given your physics background.

Mm-hmm. And you work on these problems. and how much the actual impact of this framework or research can somewhat become secondary.
Jesse Hoogland: There I want to disentangle two things. One is, do I believe in the research? And the other is, do I believe that the research, assuming it's successful, can get into the labs fast enough to matter?
On the first point, I think the theory is right. Okay. So there, there is some grain unified theory of deep learning out there in the platonic heavens, and we are starting to see various bits and pieces of that elephant, right?
You can find the trunk here and the, the, the tail. And we have two of the legs maybe and one of the tusks. And we can start to feel these pieces out. I think SLT accounts for a very important part of, of this growing body, of, of deep learning theory. And moreover, I expect that as the theory develops further, we'll see ways in which what seemed like initially disjoint theoretical units get links together.

I think that's the, the way that theory develops is you develop these bridges and stuff that was dispar initially all of a sudden get seen as two parts of the same underline. So I think, I think, you know, one day there will be a true and unified theory of intelligence and of learning and that the pieces we're working on will contribute to that.

The second question, is it, is that going to be developed fast enough to matter? And can we do the politicking sufficiently well enough that we get these techniques into the labs? And that is a, in many ways, a harder problem at this point, I think. Okay, and that's a, a, a problem that's more uncertain and more subject to these cruxes that I mentioned around how much research do we get out of the research scientists. So I'm less certain about that.

John: It's also out, out control. So you do your best

Jesse Hoogland: I think, I think, it's partially out of our control. I think there's a lot you can do as a research

John: individual, like a us.

Jesse Hoogland: I don't think so.

John: Like Oh, you okay.

Jesse Hoogland: I mean, I think, I think there's a I think as a researcher part of your responsibility is. Communicating your research, right, so you know, if, if your research feels to make a dent part, part of that is maybe that the, the academy had a hard time recognizing that the right new ideas and just is inherently slow and there are other factors that matter besides just scientific integrity and correctness. But part of it is also, I think, a failure on the scientist for not communicating things always clearly enough.

Yeah, you need to do both.

John: Thank you for doing the podcast with us. I think that helps. Mm-hmm. For broadcasting the message to a broader

Jesse Hoogland: Yeah, my

pleasure.

John: So I wanted to ask, when you envision the future, when you picture the future, what do you think of?

Oh man. That is a hard question. I think I have narrowed the timeframe over which I'm trying to predict, and in many cases, you know, my, my tendency is to defer to, to other people, defer to things coming out of meter or epoch or just the prediction markets say when it comes to to timelines. What I'm thinking about more is projecting where I think our research is going to be on a six month timeline to two years, and thinking about how that interacts with what the labs are probably going to be doing. And so most, most of my future or my, my predicting capacities invested there.

Jesse Hoogland: Mostly it's just a, a, a vague cloud of a hazy, unknowing, something. I, yeah, it's, it is just, it is another question that I spend much time thinking about in the particular, beyond sort of the narrow domains,

John: and we, we, you talked about PDU earlier, but do you, if you had to have a binary, do you generally feel optimistic or pessimistic?

I think temperamentally, I'm pretty optimistic.
Yeah.

Jesse Hoogland: so temperamentally optimistic. Rationally doomy.
John: And the reason why I ask you about how you picture the future, I have the same thing where it's hazy, but you know, you said you have a kid on the way. I have kids. And then when I'm dealing with sort of normies, I've been talking about, oh, you know, are you saving for college for the kids?

You know, and I'm like. Yeah, not thinking about college at all. I'm not thinking about college. And you know, in some ways I'm thinking about, oh, mass extinction, you know, the toxins in the air or even in the, in the crazy scenarios where, you know, we're all in the hive mind or, so I think in some ways it's very abstract and in some ways it's very science fiction. And I was just curious how someone who spends so much time and you seem to be trying to remain grounded.

Jesse Hoogland: Yeah, I think it's important that we continue to do human things. I think on, on some level, my wife and I are about to have a kid in the next month, so I think on some level this is protest. Don't, don't let the AI craze stop you from being human, something like this.

At the same time, I'm resigned to the fact that my future son probably will never contribute anything of economic value in their lives. But that's okay. I mean, we've, we found meaning in our lives before we had an economy to contribute to. So I don't know. I think, I think humans are surprisingly resilient and, and we'll make it work in that in many ways the, the situation we find ourselves in is a, is a return to the north.

It's only in the last a hundred years or so that you could anticipate what your full life would look like and playing your career ahead for decades and be before the last a hundred, 200 year period, there was much more certainty. You didn't know whether you were gonna survive, whether your kids were gonna survive.

And, and I think we can handle that. we could handle it back then. I think we can handle it again. Today?

John: Yeah, because people ask me. 'cause I can have, I think, pretty high pdo.

Mm-hmm.

You know, like, why did you have kids? So is that your answer for is that we have to deal with uncertainty?

Jesse Hoogland: We've always been dealing with uncertainty. This isn't new and I don't think it was a, a good argument against kids in the past. Well, Yeah, maybe it was a good argument against having kids when child mortality was 50%,

Mostly uncertainty. Part of it is, If we sacrifice too much, then the, what do we have left? I mean, like, the whole point of this is to, to maintain something human in the future. And if we're throwing that out away now already, then Yeah. We've already lost

John: people like Steve are gonna win.

Jesse Hoogland: Yeah.

John: why do you live in Berkeley?

Jesse Hoogland: Why Berkeley? Because it exudes this inescapable gravitational attraction to everybody working in AI safety. Berkeley's the place that a s Safety started with MI and another A SAT research 20 years ago. It's the place that grew up and developed. It's close to SF where AI progress is taking place.

So I think in many ways, you know, the Bay Area is, is sort of inevitable. If you wanna work on AI safety at some point London is another good option. But yeah, all roads led here.

John: What are your timelines?

Jesse Hoogland: I think three years is still realistic to, to three years to. You know, basically the entire AI research stack has been replaced and is now being conducted by AI. Would be five years is more of a medium. People have pushed back a little bit on this, and people seem to have in the last year become a little more conservative in their timelines and slightly slower.

It's, so three years is possible. Five years, maybe median, but there's still a long tail where things could take quite a bit longer. Maybe we get decade, maybe we can get to it.

John: And do you think once we hit that point that things get crazy right away? Like basically very quickly things get sort of off the rails?

Jesse Hoogland: Yes, but I also think we constantly underestimate the human capacity to. Normalize insanity and, and I think, you know, already looking around today in 2026, the world's insane, doesn't make any sense. And we've already mostly gotten used to it.

John: When you talked to your loved ones, I don't know how much your wife knows about this. For example, you said she didn't work in ai or your family or your normie friends. Like how did they react?

Jesse Hoogland: I think most people have a, a well adjusted sense of panic when they hear about how fast things seem to be going. And it's only a very particular set of technically literate tech bros. Most, for the most part, who are able to rationalize themselves into thinking, who's nothing to be worried about.

John: So in some ways you think that normies are more sane?

Jesse Hoogland: I think normies are more sane. I think that. Yeah, I don't wanna use the word normally too much, but people who are, who are, I think people who are outside of this bubble get to rely on a sort of default reaction and skepticism and concern about things changing very quickly.
I think people who are outside of this bubble fall back on in a conservatism around fast technological development. That is much better calibrated to the scope of what's coming.

John: Do you support a pause or do you think we should pause?

Jesse Hoogland: I'm un I'm very uncertain about, i, I would love to have more time. I'm, I'm worried about particular operationalization of laws. I think, I think it's all in the details of how this gets, gets operationalized, but I think there, I think there could be formulations that I'd be in favor of. Personally.

John: Is it the worrying about China part that is for you?

Jesse Hoogland: No, not as much. I think, I think there are real concerns about, in some cases the cure could be worse than the poison. Something like this seems possible.

John: You mean that we're not getting the benefits of AI?

Jesse Hoogland: I think it's possible that you could lock yourself into permanent technological stagnation, for example, or otherwise severely harm yourself.

John: Yeah. I was gonna ask what you thought about the acceleration argument that each day we don't have AGI. There's days where people are dying, they're suffering. We could have cures to disease, we could have longevity.

Jesse Hoogland: I think if you're gonna invoke that argument, then you have to talk about expected like the expected total cost, and that depending on, on what your beliefs are about probability, you will come to completely different conclusions about where it makes sense to put that threshold.

John: Just as an aside not even from this documentary, a lot of the accelerations people I talk to are like very unhappy in their lives, I feel like, or they're, they, you know, for example, they don't have a partner. They want, like the AI girlfriend, they want all these things and they, I just feel like that's a, just an aside of motivation that I've seen.
Do you think there's a type of person that becomes a dreamer, you know, accelerationist versus a type of person that comes.

Jesse Hoogland: Builders are typically more optimistic, accelerationist leaning. I think part of this is that in order to accomplish anything difficult, you need, in many cases a naive optimism about your odds of. Of success because if you were properly calibrated, you wouldn't take most of those risks. You, you would know that trying to start a new company, launch a startup, build some new product, change, some important part of the, of the AI stack is almost certainly going to fail.

And so the people who have built the modern world, who have created many, the technical companies that. Yeah, we take for granted now descend from this tradition of people who charge ahead, who break with orthodoxy, who are willing to try things that could have very negative consequences because they can't afford to think of of the consequences.
So I think, I think that that's the type of person who becomes, becomes a builder and right. This is a super important archetype that we need in order to actually drive progress. Because if we didn't have these people, then everything would get locked up and charged to, to a halt, and we wouldn't get innovation anymore. On the other side, who becomes a, who becomes doomer.

I think a lot of the original doomers tend to be descended from people who, who thought long and hard about the future, who have, who are more contemplative. It's hard. Part, part of this is, you know, the community I'm more familiar with is the, is the, a safety community. And so I have a more granular model of what the kinds of people are that, that show up in this community. But typically, these typically AI safety people are gonna be a little bit more neurotic.

They're going to be more concerned about risks and think more about but these kinds of things will be, they might be a little bit more conservative though in many cases. They're also descended from the same transhumanist strain that motivates many of the builders in, in SF in the surrounding area today.

So I don't think there's an easy answer to it. Yeah, there's unfortunately, the nuanced answer is just like these, there is no one type of person who becomes one or the other.

John: Okay. And as an aside, I'm just curious, how much you Eliezer, you know, their current arguments or who in the field in specifically ice safety you think are like going in the right direction?

Jesse Hoogland: I think, I think Nate and Eliezer have a good formulation of the argument. So I think, I think a lot of the field has collapsed onto a very particular framing of where the risks are. A lot of AI safety talks about risks from scheming and deception. So the thinking is something like the following. You have an AI system that's trained. At some point, it becomes situationally aware that it's being trained maybe by some quirk of the learning process. It's acquired values that it happens to know are not the exact same values that the humans who are training.

Have, what would you do if you find yourself in this situation? Well, you'll play along. You'll pretend to be aligned and you'll pretend to get the right answer. The answer you think the evaluators want you to express in order to you continue to be rewarded for, for what you want to do and pursue your long term goals, assuming that you actually have acquired these values and that they are long term, and that you've developed these deceptive tendencies.

Then maybe much later after you've been deployed, now you executed a treacherous turn and you try to gather power for yourself and and break with what you were doing before. Or maybe if you're deployed in an automated AI safety research environment, what you do is you don't investigate ways to align future systems with the humans who ask you to align that system with them.

You try to find ways to align those future systems with yourself. Or you find ways to sabotage that research and break it. I think nowadays this scheming model is the mainline risk model in the AI safety community, and I think that's unfortunate. I think that's an overly narrow presentation of what the risks are.

And if you go back longer, if you go back to writing from Eliezer, writing from Nate Suarez, and from Paul Christiano, you find a model that is more general purpose that it starts with a simple observation. That many different kinds of policies, many different kinds of values and systems are compatible with a given set for observed behavior.

Our learning processes don't uniquely pick out one of those, so it's under determined. You don't know which many different compatible generalizations you're going to get and things that are likely to show up in many of these different. Policies are power seeking tendencies, instrumental convergence, things that are broadly useful for achieving performance in a wide range of different environments?

There's a risk that the community has collapsed too quickly, and the one particular presentation of RIS school, we still don't fully understand all the ways in which these. Tendencies could, could cash out in catastrophe.

John: What do you think of AI 2027?

Jesse Hoogland: AI 2020? Yeah, I, or

John: even the updates if you
Yeah. Yeah. I, I think I found it useful to read through AI 2027 for myself. I think through the scenario that's a case where you find a very clear presentation of the scheme risk. but are you saying that people are ignoring just general instrumental convergence?

Jesse Hoogland: yeah. So let me see if I can come up with an, an example, something that breaks with this tradition.

So here's, here's, here's another kind of risk that can arise. Suppose you have a very well aligned. Model, but it's more capable than you are, even if it wants what's best for you. Then like a parent, it might find ways to reduce the number of options you have available so that you are forced to choose among a smaller set that you know that it knows it's good for you.
And so I think, I think this kind of mechanism can also happen with with AI and that that might also be enough for a catastrophic outcome. If our, if our optionalities just ends up being reduced and we slowly erode our control, and so this is, this is again, I think likely to be a consequence of the same under determinism in these training procedures that things we're not necessarily picking out. The policy we want,
I'll give that as an example, but I think there are, there's a much wider range of possible failure modes that we're just not, not even aware of, And, and a lot of the risk is just from our lack of certainty about where even the failure modes are

John: nice and impressing. Anything else from the doc? Alright, let's set up the shot.

Creators and Guests

Host

Stephen Hsu

Steve Hsu is Professor of Theoretical Physics and of Computational Mathematics, Science, and Engineering at Michigan State University.