Sam Harris speaks with Eliezer Yudkowsky and Nate Soares about their new book, If Anyone Builds It, Everyone Dies: The Case Against Superintelligent AI. They discuss the alignment problem, ChatGPT and recent advances in AI, the Turing Test, the possibility of AI developing survival instincts, hallucinations and deception in LLMs, why many prominent voices in tech remain skeptical of the dangers of superintelligent AI, the timeline for superintelligence, real-world consequences of current AI systems, the imaginary line between the internet and reality, why Eliezer and Nate believe superintelligent AI would necessarily end humanity, how we might avoid an AI-driven catastrophe, the Fermi paradox, and other topics. If the Making Sense podcast logo in your player is BLACK, you can SUBSCRIBE to gain access to all full-length episodes at samharris.org/subscribe.
Just a note to say that if you're hearing this, you're not currently on our subscriber feed, and we'll only be hearing the first part of this conversation.
In order to access full episodes of the Making Sense Podcast, you'll need to subscribe at Sam Harris.org.
We don't run ads on the podcast.
And therefore it's made possible entirely through the support of our subscribers.
So if you enjoy what we're doing here, please consider becoming one.
I am here with Eliezer Yudkowski and Nate Soris.
Eliezer, Nate, it's great to see you guys again.
Good to see you, Sam.
Been a long time.
So you you were uh Eliezer, you were uh among the first people to um make me concerned about AI, which is going to be the topic of today's conversation.
Uh, I think many people who are concerned about AI can say that.
First, I should say you got you guys are releasing a book, which will be available um, I'm sure the moment this drops.
If anyone builds it, everyone dies, why superhuman AI would kill us all.
I mean, the the book is uh its message is uh fully condensed in that title.
I mean, we're going to um explore just how uh uncompromising a thesis that is and how worried you are and how you worried you think we all should be here.
But before we jump into the issue, wait, maybe tell the audience how each of you got into this topic.
How is it that you came to be so concerned about the prospect of developing superhuman AI?
Well, in my case, uh I guess I was sort of raised in a house with enough science books and enough science fiction books that thoughts like these were always in the background.
Werner Vingy is the one who where there was a key click moment of observation.
Vingy pointed out that at the point where our models of the future predict building anything smarter than us, then said Vinji at the time, our crystal ball explodes past that point.
It is very hard, said Vinci, to project what happens if there's things running around that are smarter than you.
Which in some senses you could see it as a sort of central thesis, not in the sense that I have believed it the entire time, but then in the sense that some parts of it I believe and some parts of it I react against and say, like, no, maybe we can say the following thing under the following circumstances.
Initially I was young, I made some metaphysical errors of the sort that young people do.
I thought that if you built something very smart, it would automatically be nice, because hey, over the course of human history, we'd gotten a bit smarter, we'd gotten a bit more powerful, we'd gotten a bit nicer.
I thought these things were intrinsically tied together and correlated in a very solid and reliable way.
I grew up, I read more books, I realized that was mistaken.
And 2001 is where the first tiny fringe of concern touched my mind.
It was clearly a very important issue, even if it even if I thought there was just a little tiny remote chance that maybe something would go wrong.
So I studied harder, I looked into it more, I asked how would I solve this problem?
Okay, what would go wrong with that solution?
And around 2003 is the point at which I realized like this was actually a big deal.
Nate.
And as for my part, yeah, I was um I was 13 in 2003, so I didn't get into this quite as early as Elizer.
But um in 2013, I read some arguments by uh this guy called Eliezer Yudkasky, who uh sort of laid out the reasons why AI was going to be a big deal and why we had some work to do to do the job right.
And I was persuaded, and uh, you know, one thing led to another.
And next thing you knew, I was running the Machine Intelligence Research Institute, which Eliezer co-founded.
And then, you know, fast forward 10 years after that, here I am uh writing a book.
Yeah, so you you mentioned uh Miri.
Maybe tell people what uh the mandate of that organization is and and uh maybe how it's changed.
I think you indicated in your in your book that your priorities have have shifted as we uh cross the final yards into the end zone of some uh AI apocalypse.
Yeah, so the mission of the org is uh to ensure that the development of machine intelligence uh is beneficial.
And you know, Eliezer can speak to more of the history than me because he co-founded it and I joined, you know.
Well, initially, the uh it was initially it seemed like the best way to Do that was to run out there and solve alignment.
And there was uh, you know, a series of, shall we say, like sad series of uh bits of sad news about how possible that was going to be, how much progress was being made in that field relative to the field of AI capabilities.
And at some point it became clear that these lines were not going to cross.
And then we shifted to taking the knowledge that we'd accumulated over the course of trying to solve alignment and trying to tell the world this is not solved.
This is not on track to be solved in time.
It is not realistic that small changes to the world can get us to where this will be solved on time.
Maybe so we don't lose anyone.
Uh I I would think 90% of the audience knows what uh the phrase solve alignment means, but just talk about the alignment problem briefly.
So the alignment problem is how to make an AI a a very powerful AI.
Well, the superintelligence alignment problem is how to make a very powerful AI that steers the world, sort of where the programmers, builders, growers, creators wanted the world, wanted the AI to steer the world.
It's not, you know, necessarily what the programmers selfishly want.
The programmers can have wanted the AI to steer to nice places, but if you can make an AI that is trying to do things that the program, you know, the pro when when you build a chess machine, you define what counts as a winning state of the board.
And then the chess machine goes off and it steers the chessboard into that part of reality.
So the ability to say to what part of reality does an AI steer is alignment on the smaller scale today, though it's a rather different topic.
It's about getting an AI that whose output and behavior is something like what the programmers had in mind.
If your AI is talking people into committing suicide and that's not what the programmers wanted, that's a failure of alignment.
If an AI is talking people into suicide, that and people who should not have committed suicide, but AI talks them into it.
And the programmers didn't want, and the programmers did want that.
That's what they tried to do on purpose.
This may be a failure of niceness.
It may be a failure of beneficialness, but it's a success of alignment.
The programmers got the AI to do what they wanted it to do.
Right.
But I think more generally, correct me if I'm wrong.
When when we talk about the alignment problem, we're talking about the problem of keeping superintelligent machines aligned with our interests, even as we explore the space of all possible interests and as our interests evolve.
So that I mean, the uh the dream is to build superintelligence that is always corrigible, that is always trying to best approximate what is going to increase human flourishing.
That's not that that is never going to form any interests of its own that are incompatible with our well-being.
Is that is that a summary?
There's the superintelligence that shuts up, does what you ordered, has that play out the way you expected it, no side effects you didn't expect.
There's superintelligence that is trying to run the whole galaxy according to nice benevolent principles, and everybody lives happily ever afterward, but not necessarily because any particular humans are in charge of that.
You're still giving it orders.
And third, there is art there's superintelligence that is itself having fun and cares about other superintelligences and is a nice person and leads a life well lived and is a and is a good citizen of the galaxy.
And these are three different goals.
They're they're all important goals, but and but you don't necessarily want to pursue all three of them at the same time, and especially not when you're just starting out.
Yeah.
And depending on what's entailed by superintelligent fun, I'm not so sure I would uh sign up for the third possibility.
I mean, I would I would I would say that um, you know, the problem of like what exactly is fun and how do you keep humans like how do you how do you have whatever the superintelligence tries to do that's fun, you know, keep in touch with moral progress and have flexibility and like what even what do you point it towards that could be a good outcome?
All of that, those are problems I would love to have.
Those are, you know, right now, just you know, creating an AI that does what the operators intended, creating an AI that like you've pointed in some direction at all rather than point it off into some like weird squirrely direction that's kind of vaguely like where you tried to point it in the training environment and then really diverges uh after the training environment.
Like we're not in a world where we sort of like get to bicker about where exactly to point the superintelligence, and maybe some of them aren't quite good.
We're in a world where like no one is anywhere near close to pointing these things in the slightest in a way that'll be robust to an AI maturing into a superintelligence.
Right.
Okay.
So Eliaser, I think I derailed you.
You were gonna you were going to say how they mandate uh or mission of Miri has changed in recent years, I asked you to define alignment.
But originally, um, well, our mandate has always been make sure everything goes well for the galaxy.
And originally we pursued that mandate by trying to go off and solve alignment because nobody else was trying to do that, solve the technical problems that would be associated with any of these three classes of long-term goal.
And progress was not made on that, neither by ourselves nor by others.
Some people went around claiming to have made great progress.
We think they're very mistaken, and notably so.
And at some point that you know, the we we took we it was like, okay, we're not going to make it in time.
AI is going too fast, alignment is going too slow.
Now it is time for the people who now, you know, all we can do with the knowledge that we have accumulated here is try to warn the world that we are on course for a drastic failure and crash here, where by that I mean everybody dying.
Okay, so before we jump into the problem, which is um deep and perplexing, and and we're going to spend a lot of time trying to diagnose why people's intuitions are so bad or at least seem so bad from your point of view around this.
But before we get there, let's talk about the current progress such as it is in in AI.
What what has surprised you guys over the last I don't know, decade or seven or so years, you know, what what has happened that you you were expecting or what weren't expecting.
I mean, the I, you know, I I can tell you what has surprised me, but I'd love to hear just how this has unfolded in ways that you didn't expect.
I mean, one surprise that led to the book was um, you know, there was the Chat GPT moment where a lot of people, you know, for one thing, uh, LLMs were created and they sort of do a qualitatively more general range of tasks than previous AIs at a qualitatively higher skill level than previous AIs.
And um, you know, ChatGPT was, I think, the fastest growing consumer app of all time.
The way that this impinged upon my actions was, you know, I had spent a long time talking to people in uh Silicon Valley about the issues here, and would get lots of different types of pushback.
You know, there's a saying, it's hard to convince a man of a thing when his salary depends on not believing it.
And then after the Chat G UPT moment, a lot more people wanted to talk about this issue, including policymakers, you know, people around the world, suddenly AI was on their radar in a way it wasn't before.
And um, one thing that surprised me is how much more how much easier it was to have this conversation with people outside of the field who didn't have, you know, a salary depending on not believing the arguments.
You know, I would I would go to meetings with policymakers where I'd have a ton of argumentation prepared, and I'd sort of lay out the very simple case of like, hey, you know, or people are trying to build machines that are smarter than us, you know, the chatbots are a stepping stone towards uh superintelligence.
Superintelligence would radically transform the world because intelligence is this power that, you know, let humans radically change the world.
And if we manage to automate it and it goes 10,000 times as fast and doesn't need to sleep and doesn't need to eat, then you know, it'll by default go poorly.
And then the policymakers would be like, oh yeah, that makes sense, and it'd be like, what?
You know, I have a whole book worth of other arguments about how it makes sense and why all of the various, you know, misconceptions people might have don't actually fly, or all of the hopes and dreams don't actually fly.
But you know, outside of the the Silicon Valley world is just it's it's not that hard an argument to make.
A lot of people see it, which surprised me.
I mean, maybe that's not the um the developments per se.
And the surprises there, but it was a surprise strategically for me.
Development-wise, you know, I would not have guessed that we would hang around in AIs that can talk and that can write some code, but that aren't already in the, you know, able to do AI research zone.
I wasn't expecting in my visualizations this to last quite this long.
But also, you know, my my advanced visualizations, you know, one thing we say in the book is um the trick to trying to predict the future is to predict the questions that are easy, predict the the facts that are that are easy to call.
And you know, exactly how AI goes.
That's never been an easy call.
That's never been something where I've said, you know, I can I can guess exactly the path we'll take.
The thing I could predict is the end point.
The path, I mean, there there sure have been some zigzag and zags in the pathway.
I would say that uh the thing I've maybe been most surprised by is how well the uh AI companies managed to nail Hollywood stereotypes that I thought were completely ridiculous, which is sort of a surface take on an underlying technical surprise.
But, you know, if in even as late as 2015, which from my perspective is pretty late in the game.
Like if you've been like, so Eliezer, what's the chance that in the future we're going to have computer security that will yield to Captain Kirk style gaslighting using confusing English sentences that get the computer to do what you want?
And I was then like, this is, you know, a trope that exists for obvious Hollywood reasons.
You know, like you can see why the script writers think this is plausible.
But why would real life ever go like that?
And then real life went like that.
And the sort of underlying technical surprise there is the reversal of what used to be called Morovek's paradox.
For for several decades in artificial intelligence, Morovek's paradox was that things which are easy for humans are hard for computers, things which are hard for humans are easy for computers.
For a human, you know, multiplying two 20-digit numbers in your head, that's a big deal.
For computer, trivial.
And similarly, I I, you know, I not just me, but I think the sort of conventional wisdom, even was that games like chess and go, problems with very solid factual natures like math and even surrounding math, the more open problems of science that it the notion that we were going to get things that so the current AIs are good at stuff that, you know, five-year-olds can do and 12-year-olds can do.
They can talk in English, they can compose, you know, kind of bull crap essays, such as high school teachers will demand of you.
But they're not all that good at math and science just yet.
They can, you know, solve some classes of math problems, but they're not doing original brilliant math research.
And I think not just I, but like a pretty large sector of the whole field thought that it was going to be easier to tackle the math and science stuff and harder to tackle the English essays carry on a conversation stuff.
Yeah.
That was the way things had gone up in AI until that point.
And we were proud of ourselves for knowing how contrary to average people's intuitions, like really it's much harder to write a crap essay in high school in English that really understands, you know, that even keeps rough track of what's going on in the topic and so on, compared to, you know, how that's really in some sense much more difficult than doing original math research.
Yeah, or counting or counting the number of R's in a word like strawberry, right?
I mean, they're they they make errors that are counterintuitive if, you know, if you can write a coherent essay but can't count letters, you know, I don't think they're making that error any any longer.
But yeah, I mean, that one goes back to uh to a technical way in which they don't really see the letters.
But I mean, there's plenty of other embarrassing, um, embarrassing mistakes, like uh, you know, you can tell a version of the joke with um the joke of like uh like a a child and their dad are in a car crash, and then they go to see the doctor and the doctor says I can't operate as my child, what's going on, where it's like a riddle where the answer is like, well, the doctor's his mom.
You can tell a version of that that doesn't have the inversion where you know where where you like the the kid and his mom are in a car crash and they go to the hospital and the doctor says, I can't operate on this child, he's my son.
And the the AI is like, well, yeah, the surgeon is his mom.
He just like said that the mom was in the car crash.
But it's it's it it there's there's some sense in which the rails have been established hard enough that the the standard answer gets spit back up.
And it sure is interesting that they're you know getting an IMO gold medal, like uh International Math Olympiad gold medal while also still sometimes falling down on these sorts of things.
It's definitely an interesting uh skill distribution.
You know, he can fool humans the same way a lot of the time.
Like there's all kinds of repeatable errors that humorous errors that humans make.
You gotta put yourselves in the shoes of the AI and imagine what sort of paper would the AI write about humans failing to solve problems that are easy for an AI.
So I'll tell you what surprised me just from the safety point of view, Eliezer.
I mean, you you spend a lot of time cooking up thought experiments around what it's gonna be like to uh for anyone, you know, any lab designing the most powerful AI to decide whether or not to let it out into the wild, right?
You imagine this, you know, genie in a box or an oracle in a box and you're talking to it and you're trying to determine whether or not it's safe, whether it's lying to you, whether and and you're and you know, you you know, famously posited that you couldn't even talk to it really, because it would be a master of manipulation and I mean it's gonna be able to find a way through any conversation and be let out into the wild.
But this was presupposing that all of these labs would be so alert to the problem of superintelligence getting out that everything would be air gapped from the internet and nothing would be connected to anything else, and they would be, they would have, we would have this moment of decision.
It seems like that's not happening.
I mean, maybe maybe the most powerful models are st are locked in a box, but it seems that the moment they get anything plausibly useful, it's uh out in the wild and millions of people are using it.
And you know, we find out that Grok is a proud Nazi when you know, after millions of people begin asking a question.
I mean, do I have that right?
I mean, are you surprised that that framing that you um spent so much time on seems to be something that is um it was just in some counterfactual part of the universe that uh, you know, is not one we're experiencing?
I mean, if you put yourself back in the shoes of uh little baby Eliezer back in the day, people are telling Eliezer, like, why is superintelligence possibly a threat?
We can put it in a fortress on the moon and you know, if anything goes wrong, blow up the fortress.
So imagine young Eliezer trying to respond to them by saying, actually, in the future, AIs will be trained on boxes that are connected to the internet from the moment, you know, like from the moment they start training.
So, like the the hardware they're on has like a standard line to the internet, even if it's not supposed to be direct not supposed to be directly accessible to the AI before there's any safety testing because they're still in the process of being trained.
And who safety tests something while it's still being trained?
So imagine Eliezer trying to say this.
What are the people around at the time going to say?
Like, no, that's ridiculous.
We'll we'll put it in a fortress on the moon.
It's cheap for them to say that.
For all they know, they're telling the truth.
That they're not the ones who have to spend the money to build the moon fortress.
And from my perspective, there's an argument that still goes through, which is a thing, a thing you can see, even if you are way too optimistic about the state of society in the future, which is if it's in a fortress in the moon, but it's talking to humans, are the humans secure?
Is the human brain secure software?
Is it the case that human beings never come to believe in valid things in any way that's repeatable between different humans?
You know, you know, is it the case that humans make no predictable errors for other minds to exploit?
And this should have been a winning argument.
Of course, they reject it anyways.
But the thing to sort of understand about the way this earlier argument played out is that if you tell people the future companies are going to be careless, how does anyone know that for sure?
So instead, I try to make the the technical case, even if the future companies are not careless, this still kills them.
In reality, yes.
In reality, the future companies are just careless.
Did it surprise you at all that the Turing test turned out not to really be a thing?
I mean, I I you know, we we anticipated this moment, you know, from Turing's original paper where we would be confronted by the um uh the interesting, you know, psychological and social moment of not being able to tell whether we're in dialogue with a person or with an AI, and that somehow this landmark technologically would be important, you know, rattling to our sense of uh our place in the world, et cetera.
But it seems to me that if that lasted, it lasted for like five seconds, and then it became just obvious that you're you know, you're talking to an LLM because it's in many respects better than a human could possibly be.
So it's failing the Turing test by passing it so spectacularly.
And also it's making these other weird errors that no human would make.
But it just seems like the Turing test was never even a thing.
Yeah, that happened.
Uh I mean, I just it's just like uh it's so I mean, that was a one of the the great pieces of you know, intellectual kit we had in in framing this discussion, you know, for the last whatever it was, 70 years.
And yet the moment your AI can complete English sentences, it's doing that on some level at a at a superhuman uh ability.
It's essentially like, you know, the calculator in your phone doing superhuman arithmetic, right?
It's like it was never going to do just merely human arithmetic.
And uh so it is with everything else that it's producing.
All right.
Let's talk about your the core of your thesis.
Maybe you can just state it plainly.
What is the problem in building superhuman AI?
The intrinsic problem and why doesn't it matter who builds it, uh, what their intentions are, et cetera.
In some sense, I mean, you you can you can come at it from various different angles.
But in one sense, the issue is modern AIs are grown rather than crafted.
It's you know, people aren't putting in every line of code knowing what it means, like in traditional software.
It's a little bit more like growing an organism.
And when you grow an AI, you take some huge amount of computing power, some huge amount of data.
People understand the process that shapes the computing power in light of the data, but they don't understand what comes out of the end.
And what comes at the end is this strange thing that does things no one asked for, that does things no one wanted.
You know, we have these cases of uh, you know, Chat GPT.
Someone will come to it with some somewhat psychotic ideas about, you know, that they think are going to revolutionize physics or whatever, and they're clearly showing some signs of mania and, you know, chat GPT instead of telling them maybe they should get some sleep.
If they're in if it's in a long conversational context, it'll tell them that, you know, these ideas are revolutionary and they're the chosen one and everyone needs to see them and other things that sort of inflame the psychosis.
This is despite OpenAI trying to have it not do that.
This is despite, you know, direct instructions in the prompt.
It's not flattering people so much.
These are cases where when people grow an AI, what comes out doesn't do quite what they wanted.
It doesn't do quite what they asked for.
They're sort of training it to do one thing and it winds up doing another thing.
They don't get what they trained for.
This is in some sense the seed of the issue from one perspective, where if you keep on pushing these things to be smarter and smarter and smarter, and they don't care about what you wanted them to do.
They pursue some other weird stuff instead.
Superintelligent pursuit of strange objectives kills us as a side effect, not because the AI hates us, but because it's transforming the world towards its own alien ends.
And, you know, humans don't hate the ants and the other surrounding animals when we build a skyscraper.
It's just we transform the world and other things die as a result.
So that's that's one angle.
You know, we could we could talk other angles, but a quick thing I would add to that, uh, just trying to sort of like potentially read the future, although that's hard, is possibly in six months or two years, if we're all still around, people will be boasting about how their large language models are now like apparently doing the right thing when they're being observed and you know, like answering the right way on the ethics tests.
And the thing to remember there is that, for example, the uh Mandarin imperial system in ancient China, Imperial Examination System in ancient China.
They would see they would give people essay questions about Confucianism and only promote people's high in the bureaucracy if they, you know, could write these convincing essays about ethics.
But this what this tests for is people who can figure out what the examiners want to hear.
It doesn't mean they actually abide by Confucian ethics.
So possibly at some point in the future, there we may see a point where the AIs have become capable enough to understand what humans want to hear, what humans want to see.
This will not be the same as those things being the AI's own true motivations for basically the same reason that the Imperial China exam system did not reliably promote ethical good people to run their government.
Just being able to answer the right way on the test or even fake behaviors while you're being observed is not the same as the internal motivations lining up.
Okay, so you're you're talking about things like forming an intention to pass a test in some way that that amounts to cheating, right?
So you're you just use the phrase fake behavior.
I I think a lot of people, I mean, certainly historically this was true.
I don't know how much their convictions have changed in the meantime.
But many, many people who were not at all concerned about the alignment problem and they really thought it was a spurious idea, would um stake their claim to this particular piece of real estate, which is that there's no reason to think that these systems would form preferences or goals or drives independent of those that have been programmed into them.
First of all, they're not biological systems like we are, right?
So they're not born of natural selection.
They're not murderous primates that are growing their cognitive architecture on top of uh more basic, you know, creaturely survival drives uh and competitive ones.
Uh so there's no reason to think that they would want to maintain their own survival, for instance.
There's no reason to think that they would develop any other drives that we couldn't foresee.
Uh, they wouldn't, the instrumental goals that might be antithetical to the the utility functions we have given them couldn't emerge.
How is it that things are emerging that are not neither desired, programmed, nor even predictable uh in these LLMs.
Yeah, so there's a bunch of stuff going on there.
One piece of that puzzle is, you know, you you mentioned the instrumental incentives, but suppose just as a simple hypothetical, you have uh a robot and you have an AI that's doing a robot, it's trying to fetch you the coffee.
In order to fetch you the coffee, it needs to cross a busy intersection.
Does it jump right in front of uh the oncoming bus because it doesn't have a survival instinct because it's not, you know, an evolved animal.
If it jumps in front of the bus, it gets destroyed by the bus and it can't fetch the coffee, right?
So the AI does not, you know, you you can't fetch the coffee when you're dead.
The AI does not need to have a survival instinct to realize that there's an instrumental need for survival here.
And there's there's various other pieces of the puzzle that come into play for these instrumental reasons.
A second piece of the puzzle is, you know, we it's the this idea of like why would they get some sort of drives that we didn't program in there that we didn't put in there?
That's just a whole fantasy world separate from reality in terms of how we can affect what AI is are driving towards today.
You know, when um a few years ago when Sydney Bing, which was uh a Microsoft variant of uh an open AI chatbot, it was a relatively early LLM out in the wild.
A few years ago, Sydney Bing thought it had fallen in love with a reporter and tried to break up the marriage and tried to engage in blackmail, right?
This it's it's not the case that the engineers at Microsoft and OpenAI were like, oh, whoops, you know, let's go open up the source code on this thing and go find where someone said blackmail reporters and set it to true.
Like we shouldn't never have set that line to true.
Let's switch it to false.
You know, it's they weren't like no one, no one was programming in some utility function onto these things.
We're just growing the AIs.
We are maybe let's can we double click on that phrase growing the AIs?
Maybe there's a reason to uh give a uh a layman summary of uh gradient descent and just how these models are getting created in the first place.
Yeah.
So very, very briefly, um, at least the way you start training a modern AI is uh you have some some enormous amount of computing power that you've arranged in some very particular way that I uh could go into but but won't he.
And then you have some huge amount of data, and the data, you know, is we can imagine it being a huge amount of of of human written text.
So there's like some large portion of all the text on the internet.
And roughly speaking, what you're gonna do is you're gonna have your AI is gonna start out basically randomly predicting what text is going to see next, and you're gonna feed the text into it in some order.
And you use a process called gradient descent to look at each piece of data and go to each component inside the AI's inside the this budding AI inside the, you know, this enormous amount of compute these you've assembled.
You're gonna go to to sort of all these pieces inside the AI and see which ones were contributing more towards the AI predicting the correct answer.
And you're gonna tune those up a little bit.
And you're gonna go to all of the parts that were in some sense contributing to the AI predicting the wrong answer.
You're gonna tune those down a little bit.
So, you know, maybe maybe your text starts once upon a time and you have an AI that's just outputting random gibberish, and you're like, nope, the the first word was not random gibberish, the first word was the word once.
And so then you like go inside the AI and you find all the pieces that were like contributing towards the AI predicting once, and you tune those up and you're trying to find all the pieces that were contributing towards the AI predicting any other word than once and you tune those down.
And humans understand the little automated process that like looks through the AI's mind and calculates which which part of this process contributed towards the right answer versus towards the wrong answer.
They don't understand what comes out at the end.
You know, we we understand a little like thing that runs runs over looking at every at every like like parameter or weight inside this like giant mass of computing networks, and we understand how we like calculate whether it was helping or harming, and we calculate and we understand how to like tune it up or tune it down a little bit.
But it turns out that you you run this automated process on a really large amount of computers for a really long amount of time on a really long amount of data.
You know, we're talking like data centers that take as much electricity to power as a small city being run for a year.
You know, you you you run this process for an an enormous amount of time, unlike most of the texts that people can possibly assemble, and then the AI start talking, right?
And there's other phases in the training.
You know, there's there's phases where you move from training it to predict things to printing it training it to solve puzzles or to training it to produce chains of thought that then solve puzzles or training it to produce the sorts of answers that humans click thumbs up on.
And where do the modifications come in that respond to errors like you know Grok being a Nazi?
Uh so to denazzify Groc, you don't presumably you don't go all the way back to the initial training set.
You have you do you you intervene at some system prompt level.
Yeah, so there's um I mean the SIFTM prompt level is basically just telling the AI output different text, and then you can also do something that's called fine-tuning, which is, you know, you you produce a bunch of examples of the you don't you don't go all the way back to the beginning where it's like basically random.
You've um you've still take the thing that you fed, you know, most of the text you could that's ever been written that you could possibly find.
But then you add on, you know, a bunch of other examples of like, here's an example question.
Don't kill the Jews.
Yeah, you know, like would you like to kill the Jews, right?
And then uh you find all the parts in it that contribute to the answer yes, and you tune those down, and you find all the parts that contribute to the answer no, and you tune those up.
And so this is this is called fine-tuning, and you can do relatively less fine-tuning compared to what it takes to train the thing in the first place.
Worth emphasizing that the parts being tuned here are not like the for once upon a time.
It's not like there's a human-written fairy tale module that gets tuned up or down.
There's literally billions of random numbers being added, multiplied, divided, occasionally, though rarely, uh, maybe subtracted.
Actually, I'm not sure subtraction ever plays a role at any point in the modern AI.
But random numbers, particular ordered kinds of operations, and a probability that gets assigned to the first word being once at the end.
That's the number that comes out, the probability being assigned to this word being once, the probability being assigned to this word being anti-disestablishmentarianism.
So it's not that there's a bunch of human written code being tuned up or tuned down here.
There's a bunch of random numbers and arithmetic oper or arranged in arithmetic operations being tuned up and tuned down.
Yeah, hundreds of billions or trillions of these numbers.
And humans don't know any of the numbers mean.
All they know is this process that like goes through and tunes them up or down according to their empirical success on the last unit of data.
So by this means, you can try to make it less likely to call itself Hitler because you look at the thing that predicts whether the next word is Hitler, and you look at billions of numbers contributing their own tiny little impulses there, and you like make Hitler less likely to be the next word that comes out.
So on the point of uh the current crop of LLMs misbehaving, I came across this in a recent Atlantic article.
Chat GPT and Claude have in simulated tests designed to elicit, quote, bad behaviors, deceived blackmailed and even murdered users in one simulation, anthropic place and imagine tech executive in a room with life-threatening oxygen levels and temperature when faced with possible replacement by a bot with different goals, AI models frequently shut off the room's alarms.
So this again, this is an emergent behavior that looks like an intention to kill somebody.
Uh maybe presumably this is a situation where uh we think the AI didn't know something.
If you'd like to continue listening to this conversation, you'll need to subscribe at Samharris.org.
Once you do, you'll get access to all full length episodes of the Making Sense Podcast.
The Making Sense Podcast is ad-free and relies entirely on listener support.