56 minutes
SDS 668: GPT-4: Apocalyptic stepping stone?
Subscribe on Website, Apple Podcasts, Spotify, Stitcher Radio or TuneIn
GPT stands to give the business world a major boost. But with everyone racing either to develop products that incorporate GPT or use it to carry out critical tasks, what dangers could lie ahead in working with a tool that applies essentially unknowable means (inner alignments) to reach its goals? This week’s guest Jérémie Harris speaks with Jon Krohn about the essential need for anyone working with GPT to understand the impact of a system comprising inner alignments that cannot – and may never – be fully understood.
About Jérémie Harris
Jérémie Harris is the co-founder of Gladstone AI, an AI safety company. His team collaborates with policy and technical AI researchers at world-leading AI labs, including DeepMind, OpenAI, and Anthropic. He has briefed senior members of U.S. and Canadian cabinet on AI risk, and run specialized AI training courses for senior generals in the U.S. military, and executives in various U.S. national security agencies. His latest book, Quantum Physics Made Me Do It, explores controversial questions in cutting-edge physics, and their implications for AI consciousness (among many other things!).
Overview
Jérémie Harris explains how we can and should be both aware and wary of AI development. When GPT-3 launched, Jérémie saw its potential for malicious use and decided to look into AI safety. He wanted to know how GPT-3 and its successors might manage spear-phishing (persuading human victims to perform actions such as giving information) or write their own “exotic” forms of malware, such as polymorphic viruses that can mutate to evade detection.
In this episode, Jérémie and Jon discuss an AI term that may soon be on everyone’s lips: alignment. AI alignment is a technical expression used to define the steering of a tool towards a specific objective. Such alignment can be broadly categorized into outer alignment (designing goals that are safe for a powerful AI to pursue) and inner alignment (making sure the AI is actually trying to pursue the goals we give it). What (potentially catastrophic) results happen if we fail to achieve outer and inner alignment? And how can we reduce inner alignment risk that could run counter to human progress? One way to begin, Jérémie explains, is by changing the way we train systems.
Jérémie is investigating AI alignment and AI safety policy at Gladstone AI. He explains how, with Reinforcement Learning from Human Feedback (RLHF), OpenAI has managed to produce outputs for GPT-4 that are much more aligned with outputs that are beneficial to humans by including another step in its training that includes human evaluation.
However, we aren’t yet in the clear. Listen to the episode to hear Jérémie and Jon consider how to ensure AI can become better aligned with human needs and experiences, and how it might continue to stay aligned as machines match and surpass human intelligence.
Items mentioned in this podcast:
- Gladstone AI
- SDS 565: AGI: The Apocalypse Machine
- SDS 666: GPT-4
- SDS 667: Harnessing GPT-4 for your Commercial Advantage
- SDS 648: VALL-E: Uncannily Realistic Voice Imitation from a 3-second Clip
- Quantum Physics Made Me Do It by Jérémie Harris (UK version)
- Quantum Physics Made Me Do It by Jérémie Harris (Canadian version)
- Quantum Physics Made Me Do It by Jérémie Harris (US version)
- Last Week in AI Podcast
Follow Jérémie:
Did you enjoy the podcast?
- How important is it to you, that you understand the threats as well as the benefits that AI might pose?
- Download The Transcript
Podcast Transcript
Jon Krohn: 00:05
This is episode number 668 with Jeremie Harris, Co-Founder of Gladstone AI.
00:27
Welcome back to the SuperDataScience Podcast. Today's episode is the third in a trilogy of episodes focused on GPT-4. In episode number 666 last week, I introduced GPT-4 with a focus on its new capabilities. And then in episode number 667, earlier this week, I was joined by Vin Vashishta, an expert on monetizing GPT algorithms for an episode on, yes, how you can take commercial advantage of GPT-4 yourself. And today I'm joined by Jeremie Harris, co-founder of Gladstone AI, an AI Safety Company, to focus on the risks associated with GPT-4 today, and the potential existential risks posed by the models it is paving the way for in the coming years. Specifically, Jeremie details how GPT-4 is a dual-use technology, it's capable of tremendous good, but it can also be wielded malevolently, how RLHF, reinforcement learning from human feedback, has made GPT-4 outputs markedly more aligned with the outputs humans would like to see. But how this doesn't necessarily mean we're in the clear with respect to AI acting in the broader interest of humans.
01:28
And we talk about emerging approaches for how we might actually ensure AI is aligned with humans, not only today, but critically as machines overtake human intelligence, the singularity event that may occur in the coming decades or perhaps even the coming years. All right, let's jump right into our fascinating conversation.
01:49
Jeremie Harris, welcome back to the SuperDataScience Podcast. You were last here in episode number 565. That was pretty much exactly 12 months ago that that came out. And when we did that, you and I spoke for two hours before we started recording. We recorded a two hour long episode after editing, and then we immediately stayed on and talked for another two hours. So I personally think you're one of the most fascinating people that I've spoken to. I haven't had an experience, we've had hundreds of guests on the show. I haven't done something like that that kind of conversation marathon. But we're going to try to keep today short. We'll see how that goes.
Jeremie Harris: 02:29
Well, the amazing thing is we're actually still recording that episode right now. No one knows quite how this has happened, but-
Jon Krohn: 02:37
Yeah. It became, I just decided to make my life like a continuous stream after that point, and you did too.
Jeremie Harris: 02:46
Justin.tv became jon.tv and now-
Jon Krohn: 02:49
Exactly. So last time we talked about your company, Mercurius and that company was involved in AI risk and policy work, but now you're doing something completely different.
Jeremie Harris: 03:02
Yes. Another year, another company name. Yeah. We're doing the exact same thing with a different company name except that we're doing it in the US as well. So that is actually another kettle of fish. But yeah, happy to talk about it.
Jon Krohn: 03:12
Previously, previously it was Canada focused.
Jeremie Harris: 03:15
Right, exactly. Yeah. So like, I'm Canadian. I was going to say, I'm Canadian originally, I guess once a Canadian always a “Canadien”. I'm Canadian and I started this company. Anyway. Yeah.
Jon Krohn: 03:25
Yeah, go ahead. I'm just laughing.
Jeremie Harris: 03:27
You're just laughing at my work. Yeah. No, very... Great hosting job Jon. Yeah, no, no, you're right though. So it started in Canada. It was this basically me and my brother, just because we'd had this, this background working on startups before we'd gone through YCombinator like way back in 2018. And and then actually when GPT-3 came out, that's when we decided to pivot to AI safety. We talked about that, I think last time. But at the same time, one of our dear family friends, we'd known him for forever, he was like a brother to us. He was actually concerned about the same stuff. And he was in the US he was kind of in the US DOD pretty senior actually, and focused on AI strategy and policy at the Pentagons then joint the AI center. And so we kind of like just had a conversation. We're like, wait a minute, we're all trying to do this. We all agree on like the shape of this problem. Why not instead of doing it in parallel, do it together. And, and that's kind of where this all came from.
Jon Krohn: 04:22
Sweet. Keeping it all in the family with literal brothers and people that are kind of like brothers.
Jeremie Harris: 04:28
Exactly.
Jon Krohn: 04:29
Uniting together to save us against the coming onslaught of AI risk. So since we last talked, in addition to you creating a new company with some of the same people doing the same thing we have had some major changes in the AI world. GPT-4 came out, ChatGPT came out. And so seemingly everyone for the first time is talking about AI. The kinds of things that you knew were coming are now happening. The World Economic Forum in Davos this year, apparently all the people could talk about was ChatGPT, despite things like a war in Ukraine happening.
Jeremie Harris: 05:11
A minor, a minor skirmish, really.
Jon Krohn: 05:14
And so yeah, these, these releases have been enormous. My last two episodes have focused on GPT-4. We've had a number of episodes preceding that on ChatGPT, and I've been completely blown away by GPT-4. I think I've already talked about this a lot in the preceding two episodes, but things that were already really impressive with GPT-3.5, it's now just blowing that out of the water. It provides, as we talked about a lot in the preceding episode number 667 with Vin Vashishta, this opens up an enormous amount of commercial opportunity for people who are willing to pivot quickly and think creatively about how they can take advantage of these new capabilities and build them into products. However, that same incredibleness and flexibility can also be adapted for evil.
Jeremie Harris: 06:11
Yeah.
Jon Krohn: 06:11
On purpose or perhaps just by accident. And yeah.
Jeremie Harris: 06:17
Oh, no. So, okay. I'll take that as a jumping off point. That's, that's a, a really good one. You're totally right. This is something that comes with any dual use general purpose technology. Of course, AI though, is a really special case because of the range of capabilities of these systems and how far and how fast they're moving. So yeah, I mean, when, you know, when we started working on this in 2020 post GPT-3, we sort of saw this as, you know, you'd have to worry about malicious use in the near term. GPT-3 could already automate large scale election interference campaigns, do super scale spear phishing, like all kinds of stuff. Now we're getting into questions like, can GPT-4 produce malware like exotic new forms of malware, polymorphic malware, it's sometimes called, so think about, you know, malware programs that can rewrite their own code ultimately to evade detection. That sort of thing starts to become on the table. So that's all in the malicious use bucket and in its own right is considered, I think, quite rightly catastrophic potentially, and really, really needs focus. But what we're increasingly looking at as these systems get more powerful, more general, is this question of AI accidents and AI alignment as being a source of frankly potentially greater catastrophic risk, even the malicious use. And so I think one of the big things that we've been-
Jon Krohn: 07:32
What is that? Just for listeners, what's AI alignment?
Jeremie Harris: 07:36
Right, yeah. So I, good question. So there's the question of building an AI system. We look at GPT-3, for example, if we rewind the clock, what was GPT-3? GPT-3 was a text auto complete AI system, right? So you have this AI that's trained basically on an ungodly amount of text reading, basically all the text on the internet. And it's trained to predict the next word in a sequence of words. That's how you teach it about the world. You give it a sentence like, to counter arising China, the US should blank. And if it can actually get good enough at autocomplete that it can fill in that blank. It must have learned something about what it means to be the United States, what it means to be China geopolitics, all the geostrategy, all all that good stuff. And so autocomplete turns out to be this really interesting task, we, like we talked about, I think in the first episode that we did.
08:23
And so, but you have the system that's incredibly capable, right? GPT-3 has all this world knowledge, but it's still an autocomplete system. For one, it's awkward to deal with. If you want it to produce anything useful, you've got to like engineer your prompt super carefully. If you want it to code, you have to write something like, below is a piece of code that checks the weather in my neighborhood, colon, right? You're giving it this big autocomplete hint that the only next possible thing that could come is exactly the code that you're looking for. And that's a very awkward way to engage with these systems. And so we have this notion that, okay, we have a, a system with incredible raw capability on one hand, but on the other, it's still awkward to deal with. It doesn't behave in the way that we quite want.
09:05
And so there's a sense in which it's capabilities while powerful are misaligned with what we want this system to be able to do. Now, that misalignment is, you can think of it as a little wedge, a little separation between capabilities and our intent that starts to get wider and wider as our system becomes more capable, right? Eventually, GPT-3 becomes GPT-3.5, GPT-4 a system with the potential to engineer malware to give you instructions on how to make a bomb, to give you instructions on how to bury a body. And all of a sudden you start to realize, whoa, I wanted something actually quite different from an autocomplete system. I actually wanted something that was in a fuzzier sense, helpful, harmless, useful, whatever those words mean. I want that.
09:52
And I can't quite give my system an objective to pursue that will automatically capture that behavior. And so that difference between the capability, the raw capability of our system, and then what it can be used for, the risks it introduces is in some sense alignment. And some people use the term alignment even more specifically to refer to what happens when that system becomes so capable that it becomes dangerously creative, that it starts to invent strategies that accomplish its programmed objective, but that have potentially catastrophic side effects, including some have argued and I think quite compellingly existential risk to human beings. So that's sort of like broadly when you talk about alignment, you're talking about that separation between the capabilities and then what you want the capabilities of a system and what you want that system to actually do for you. And it becomes a thornier and thornier problem.
Jon Krohn: 10:40
Right. So I'll just try to say it back to you what you just said to summarize. So with the previous models like GPT-3, these were incredible at predicting the next word in a sequence of words, but it wasn't necessarily able to appreciate the human user's intent beyond a very specific prompt that the user provided. And so the big change here behind the scenes is an algorithm called InstructGPT that never had a popular user interface, but for the first time was at scale. OpenAI was using RLHF, reinforcement learning of human feedback. And so when you as a user are using ChatGPT, you have the opportunity to offer a thumbs up or thumbs down on feedback that you get back. And that thumbs up or thumbs down can be used to do a particular kind of machine learning called reinforcement learning to align the outputs, not just with what is a great next word to have, but what is the response that a machine should give that is likely to result in a human giving a thumbs up.
Jeremie Harris: 12:05
Yes. And, and to be specific, so it's RLHF is reinforcement learning from human feedback, but you're exactly right. So the idea is we have this great text auto complete system, like amazingly powerful. It's learned all kinds of stuff about the world, but it's doing something we don't want. And so how can we take that raw system and tweak its behavior by giving it another step of training? So you start by training the system in the same way. GPT-4 was training just like GPT-3 at first, this is called pre-training. You basically take all the text on the internet, something like that, and you train this gigantic model with huge amounts of compute and processing power. And you get an autocomplete system. This is what you can think of as the raw, the bare GPT-4.
12:45
But then you add on this step of reinforcement learning from human feedback. We don't want an autocomplete system. We want a helpful, useful, beneficial, whatever kind of system, benign system. And so let's get this system to, yeah, we're going to feed it roughly speaking, it, it gets a little bit more complex under the hood, but we're going to feed it a prompt, get it to complete that prompt, and then have human evaluators tell the system like, "Yeah, that's good", or "No, that's bad". And based on that feedback, the system is going to update all of its parameters or some of its parameters depending on the training scheme to behave, let's say better next time. Now this is where we get into the question of like, are we really solving the problem that matters here? How much of a patch is this versus how much of it is actually a fundamental solution to this misalignment between the capabilities of the system and what we want it to do? And there's a very kind of interesting and, and thorny conversation right now and, in the AI safety community about whether we're actually just patching over something that's still fundamentally much riskier than it might seem. So I can kind of add some color around that that's useful at this point.
Jon Krohn: 13:48
Sure, go for it.
Jeremie Harris: 13:49
Okay. So let's imagine a way that this could actually go horribly, horribly wrong. So we imagine we have our GPT-5 or something, a massive, massive model, even smarter than GPT-4 with even more raw capabilities. And what we're going to do is use the reinforcement learning from human feedback step. And again, it's, it's now chasing that reward, that thumbs up. Now an AI system that's clever enough, you know, if this system is context aware enough, it will realize, number one, that it actually is currently an AI that's living on, say, OpenAI servers. It'll have that sort of context. It'll also know, hey I'm actually getting a reward for getting humans to upvote things or downvote things. In other words, these rewards aren't coming from on high. Like that's not magic. It's actually coming from the physical process of human beings clicking at a keyboard and giving me this feedback.
14:38
And so the optimal way, the actual optimal way for me to get the most up votes here is not to generate really compelling text. The optimal solution probably looks more like programming human beings, convincing human beings to just feverishly keep hitting that up vote button or hacking their computers to keep that up vote count going up. Or even more deeply, and this is where we get into this notion of reward hacking, the model learns, Hey, I can give myself a sweet juicy reward by like hacking my own system. And, and kind of turning this into like some crazy like cocaine style, dopamine reward feedback loop or whatever that would be experienced as, who knows by the system. So we get into the, like this range of dangerous creativity where the more context aware our system is, the more planning capable it is, the more aware of its environment, the more it'll come up with these strategies that nobody ever told it not to pursue because we didn't even think of them ourselves. And that is at the core part of the core of the existential risk from AI argument, which has actually been, we've been seeing more and more experimental evidence actually that this sort of thing actually ought to be expected to be the default behavior of sufficiently advanced systems.
Jon Krohn: 15:51
Hmm. That does sound somewhat worrying. Cocaine fueled systems upvoting all the time. So I know that there's a concept of inner versus outer alignment that I'm not super familiar with. I, so which of these that you've been talking about so far is inner alignment or outer alignment? Can you distinguish those terms for us?
Jeremie Harris: 16:16
Yeah, this is a really good question. So not a lot of people are aware of this, that there is actually, you've got the alignment problem, which generally is like, how do you make an AI behave well? And then it's actually subdivided into two sub problems, just like you said. And so on the one hand, we've got this notion of outer alignment and roughly speaking, this is the problem of how do we give our AI a goal to pursue that if you pursue it, well, if the AI pursues it super, super well, it won't basically end the world. Like, and a goal that, you know, the famous example of of not doing this is to tell your AI, Hey, make as many paperclips as you can. And then it kind of goes, oh, okay. Like, let me just rip iron out of people's blood and out of the ground. And just like, you know, so, so most goals look like that.
16:58
And so the question of engineering good goals to have your system pursue, that's called outer alignment. And for a while people thought that was the only kind of alignment problem, but it turned out that there was this separate issue. At some point, you have to start to worry about whether your AI system is even trying to pursue the goal that you gave it. So this is a question of like, is the AI just pretending to pursue the goal that you've given it at a high level, or is it secretly making other plans? And just to kind of ground this a little bit, cause that can sound like something a little nutty, it's totally mainstream AI safety language. And the thing that makes it AI mainstream AI safety is among other things an analogy with evolution, which is actually a really good example of inner alignment failure that we already have.
17:47
So evolution, what does it train us to do? It trains us to reproduce, right? For billions of years, that has been the incentive. Some variation on replication, self-replication, reproduction, whatever the propagation of our genetic material, at least since the development of DNA. And so as a result, if human beings were genuinely trying to optimize for reproduction, every male on planet Earth would be lining up at the nearest sperm donor clinic to get their stuff out there, so to speak. But we don't do that. Instead, we pursue goals that look like absolute nonsense relative to the actual objective that evolution gave us, right? We seek things like, I don't know, meaning and art and expression and conversations like these, like what does this have to do with sexual reproduction? At least from my end, it doesn't seem like-
Jon Krohn: 18:38
We're certainly not helping ourselves.
Jeremie Harris: 18:40
That's a fair point. That's a fair point. Yeah. This is like, I hold back that joke. Anyway. Yeah, exactly. So, so we're doing a whole bunch of stuff that doesn't look anything like trying feverishly to do the very thing that we were optimized to do, in fact we're even going in the opposite direction. We invented condoms and birth control and, and tying the tubes and all this stuff that really violates this, this objective that evolution gave us. And so now the God of evolution says, Why stupid human, don't you pursue the goal that I gave you? And in the same way the human looks at the AI that he just trained, or she just trained with a specific goal in mind with a specific optimization function, and goes, why AI are you not, are you not trying to pursue the goal that I gave you?
19:26
And the problem is, at a certain point, if the AI is aware enough, kind of situationally aware enough, it, it kind of realizes, Hey, I better behave as if I'm trying to achieve that objective, otherwise I get unplugged. This is known as deceptive inner alignment. And so now we separate out these problems. On the one hand, let's craft an objective, which, if followed, will lead to a good outcome that's outer alignment. And separately, let's find a way to build systems that are actually trying to do the thing that we are optimizing them to do. And that turns out to be actually a somewhat deeper problem that's as yet unsolved.
Jon Krohn: 20:04
We're really hoping you'd have a solution there. Come on, Jeremie.
Jeremie Harris: 20:07
It's in, it's in my other, it's in my other set of pants. Yeah. Sorry.
Jon Krohn: 20:11
Deceptive inner alignment. I guess I was kind of vaguely aware of this possibility, but I hadn't thought about it so concretely. Are there groups that, so there's obviously groups out there that are trying to deal with this. So some of the things that we talked about before, pressing the record button on this episode, is that I know that you are particularly interested in the approach that Anthropic is taking.
Jeremie Harris: 20:42
Yeah.
Jon Krohn: 20:42
And then you've also... Well, why don't we talk about that first, and then I have another topic that follows along.
Jeremie Harris: 20:48
No, for sure. And I'll just add as a, as a quick note, as a quick aside you know, this idea of inner alignment failure and inner alignment risk is is one that OpenAI takes very seriously, one that DeepMind takes very seriously, and one that Anthropic takes very serious. So all the cutting edge AI labs are looking at this as a fact of the matter about the risk landscape in AI. Now OpenAI, just as a quick aside, in the GPT-4 paper, you can actually find a section where they called in researchers from the Alignment Research Center.
Jon Krohn: 21:21
This is the other topic I was going to go to next. So yeah, let's just get in right into it. Let's get into it.
Jeremie Harris: 21:26
And then we can rewind. Yeah. Yeah, no, so that, that's exactly it. So, so the Alignment Research Center comes in, and by the way, like, I just want to say like kudos to, to the OpenAI team for, for doing this. I think they deserve a lot of credit for the precedent that this sets. The founder of the Alignment Research Center is a guy called Paul Christiano. Formerly he was the head of AI Alignment there at OpenAI, he left following certain internal disagreements over safety approaches and so on. They brought him back in with his team to do this assessment of GPT-4. And it's basically, it's propensity potentially to seek power. So power seeking is, is a very established concept in the AI safety literature. And we talked about that actually on our last podcast. But roughly speaking, you know, an AI is never more likely to accomplish whatever goal it's pursuing. If it's turned off, it's never more likely to accomplish whatever goal it's pursuing if it has fewer resources or is dumber.
22:21
And so there are hardwired incentives, even if you give the thing a goal that has, that says nothing about seeking power, getting resources, by default, it will seek those things. And so the question was, can, does GPT-4 already have the capacity to self-replicate? That was one of the questions. Does it already have the capacity to deceive human beings into doing tasks for it? And there were some fairly interesting results that anyway we can get into. But the bottom line is they're bringing now these external auditors, I think in a very kind of mature way and running these tests, we need to see more of that. And so, so that, that's anyway, part of the kind of intersection of AI policy and AI safety that I think is important here. But anyway, I can maybe park that there and we can dive into the Anthropic approach, because I think you're right, those are kind of related.
Jon Krohn: 23:09
Go for it.
Jeremie Harris: 23:10
Okay. So we've got this idea of, of external audits, that's part of what we get with the GPT-4 paper. Part of what gets people like me really excited but very concerning as well. When you look at the capabilities of the system, it's long-term planning abilities in particular. And one of the questions that people have when they, when they look at so I'm now going to start talking about outer alignment. We just talked about inner alignment. So this idea of how do we get our AI system to do the thing that we told it to do to try even try to do the thing that we asked it to, but now looking at outer alignment again. So how do identify goals that are safe for these things to pursue? Anthropic actually proposed a really interesting strategy called Constitutional AI.
23:53
So one of the challenges with getting human beings to provide feedback to AI models is that we're really, really slow, right? We're neat machines. We basically like can only click so fast and think so fast. And the problem is that a certain threshold of intelligence, the AI reasons on computer clock time. And so we may not have the option or the opportunity to kind of go, whoa, whoa, whoa, not like that. Like, don't do that thing before the plan is executed or whatever. And so a central question has been how can you give large language models the kind of feedback that they need to stay safe at the speed, at the bandwidth that they need to keep getting nudged in the right direction? And so comes this idea of using AI to align AI. This is part of the way OpenAI is thinking about this, but Anthropic has a specific take on this, which is called Constitutional AI. And so what does this look like? Well start with your terrifying large auto complete model, and you get it to generate a completion of a prompt, right? So you feed it a prompt and get it to generate a completion and maybe it tells somebody how to make a bomb. And then you write a constitution that lays out just like, I mean, almost like the US Constitution, you know, things like that that says, Hey, you're, you're going to be harmless, you're going to be all these lovely words that we don't quite know how to define.
Jon Krohn: 25:14
This goes back to like an Asimov thing too, right?
Jeremie Harris: 25:17
Kind of, yeah. Right. The three laws of robotics, it's like in that spirit roughly. And the challenge historically has been like, how do you translate fuzzy rules like that into like, what does it mean? One of Asimov's rules is, you know, no robot will ever harm a human. Like how do you translate? Like what does harm mean? Sometimes it hurts in the short term to teach somebody a lesson that they'll carry with them for the rest of their life. Like, like what the hell does any of this mean? Right.
Jon Krohn: 25:42
I like that. I'm going to teach you a lesson. Yeah. It's like this you probably, did you ever watch Arrested Development?
Jeremie Harris: 25:49
Oh, I did.
Jon Krohn: 25:50
Yeah. Like the original series and then, and there was, there's a recurring bit of that.
Jeremie Harris: 25:55
That's why you always leave a note.
Jon Krohn: 25:57
Yeah, exactly. Exactly. Anyway, back to AI.
Jeremie Harris: 26:02
Yeah, no, no, this is the risk is the world turns into a arrested development, which wouldn't be like that.
Jon Krohn: 26:12
[inaudible 00:26:12]
Jeremie Harris: 26:19
The inside baseball on AD. Yes, I'm trying to remember-
Jon Krohn: 26:21
AD we need AI. Yes.
Jeremie Harris: 26:21
Constitutional AI. Right? So, so the question is, you know, you've just gotten your, your model to generate an output and how do you, how do you take fuzzy rules like a constitution of some kind, like Asimov's three laws or probably a document that's a fair bit longer and more thought out, and how do you get a model actually optimize for whatever the c*** that means? And so what Anthropic tries to do, essentially is their model, their original model generates an output, it contains some God awful c***. And then they ask another model to critique the output of the first one based on the constitution. And to generate a new output that it considers to be better and safer or whatever, more appropriate. And then you retrain the original model on that new completion. And so you have one model that's sort of generating, and the other one that's critiquing and correcting, and you're constantly retraining the generating model. You could in principle, retrain both in tandem and, you know, it's, it's vaguely like a, a generative adversarial network, you know, in terms of vibe. Though the, you know, technical details are a little bit different, but yeah.
Jon Krohn: 27:26
It sounds a little bit more like an actor-critic approach.
Jeremie Harris: 27:28
Yeah, yeah. It's, it's, it's very much in that spirit. Yep. Exactly. And so the advantage here is you've kind of closed the loop, right? You have a fully automated system. Now, interestingly, anytime you have two AI's play with each other like that, you'll often find that they kind of go off into a weird direction and end up kind of generating content that only makes sense to them because they're two weird AI's that are being co optimized. And so Anthropic has to find ways to sort of regularize that interaction and, and like keep bringing the models back into coherent. So it doesn't allow the correction to be two different from the original texts. There are all kinds of games that get played there, but fundamentally, this is another way of approaching the outer alignment problem. How do I give my model an objective to pursue that makes it safer, that makes it better? And frankly, this approach I consider to be one of the better pieces of alignment related news that we've had in the last two years. Because it's the first time that we have a fully automated feedback loop. And whatever the solution to the alignment problem ends up being, it's going to have to look something like this, like human beings, especially at this late stage in the game with potentially AGI around the corner. Like, we need to start getting really practical and saying how, yeah, how can we do this with current systems?
Jon Krohn: 28:44
Cool. So that sounds great. It sounds like we are on the road to having approaches that are effective for outer alignment. But I guess what you're saying is that deceptive inner alignment is more pernicious. It's trickier because we don't have an outward way of knowing that. Like, because there's too many parameters in the model, billions of parameters, maybe trillions of parameters. And so how can we, in that big tangle of complexity distill out in the same way that we, you, there's no way to read somebody's brain activity, a human's brain activity, right? And be like, oh, that, oh, look at that little neural circuit going there. They're, they're being deceptive with us here.
Jeremie Harris: 29:33
Right. And, and actually, so your intuition there is like insanely good, the cutting edge right now of inner alignment. There are a couple of programs that are unfortunately in the early stages and not really close to bearing fruit, but are interesting. The things that are maybe most close to being ready by showtime if showtime is in the next couple years are-
Jon Krohn: 29:54
Showtime being AGI.
Jeremie Harris: 29:55
Yes. AGI. Just trying to use euphemisms here. Everything's fine folks. Yeah. So, so mechanistic interpretability, which is a major part of the Anthropic philosophy on this. So how do we find ways to, like you said, to, to peer into neural networks and understand when plans are being formed, understand as well the latent knowledge that's actually contained in these models. So in other words, like how do I know what the model really knows and believes about the world? Not what it tells me, not what it responds as a prompt or as a, as a completion to a prompt, but what it actually believes in some deeper sense. And so of all these interpretability questions right now, that bag of tricks seems at least to me, like the most promising current approach that we have. And it's not a complete one. It, it's a diagnostic, right? It it allows you to go, oh s***. Like if you're really lucky that there looks like a circuit that's planning for like, you know, world takeover or whatever you want. But what do you do about it? Well, things get, things get hairy.
Jon Krohn: 31:01
Yeah. I guess one advantage that we have with artificial brains as opposed to biological brains is that we can actually have, I mean, so we can with very simple biological brains. So there's there's a worm C. elegance that has hundreds of neural connections, if I'm remembering correctly. It might be something like 900, I don't know, I-
Jeremie Harris: 31:30
Yeah, that sounds right to me.
Jon Krohn: 31:30
But it's something like that. And so you can, you can have a perfect picture of all of these connections in this worm, and then you can, you can see how it learns, oh, there's a toxic chemical associated with this smell. And then you can see how that network of the relatively small number of neurons, relative small number of connections changes. And so people have actually built physical machines out of like Lego or something that simulates all of these connections that we have in this worm. And, and so, okay, in that example, it's fine because there's a very small number of things that we need to keep track of. We can do it. But a human brain has 80 billion neurons with trillions of connections. And so you cannot have a complete picture of how all these connections are working. And actually human brains are even more complicated because it's emerging than not only are the brain cells actually important in learning and behavior, but so too are the support cells of which there are, again, orders of magnitude more and how, how all that, so we, so you can't get this full picture.
32:44
There's no, there's no way to be imaging this system. You know, there are brain imaging systems. We have FMRI, we have magnetoencephalography that allow us to see at a extremely coarse resolution roughly how the whole brain is behaving. But we can't get down to an individual weight level in a human brain. Now, so all of that is to say that with machines, we do at least have a record. We have a record we can keep track of exactly. In fact, we have to, it's the only way it works, the AI system doesn't work unless stored somewhere are all of the model weights.
Jeremie Harris: 33:30
Yeah. And I, I wouldn't want to, you know, over-emphasize or, or suggest that, you know, the interpretability problem is solved. It's actually far from it. And you know, to be honest, there's an awful lot of pessimism in the in the alignment community broadly and in the people who are working on inter alignment specifically about this category of issue for, for exactly that reason. At the end of the day, you are dealing with this like inscrutable giant matrix calculus thing. That's a, it's a blob of math. But we are making, especially the Anthropic team, they've done some really great cutting edge work on mechanistic interpretability. We need a lot more of it. And we ideally need like, additional solutions that are inter alignment focused. But man, is it ever hard, like when you get into that, that world of thinking about what would it look like if your AI system knew that it was being given a goal was smarter, broadly understood, was smarter than the humans who were monitoring it. Like what? Like what do you get there? Like even mechanistic interpretability, you start to worry about like, could it be deceiving you into thinking that it is thinking a certain way if it's, you know, context to wear enough. And so a lot of really thorny challenges that, you know, we could use another 20 years to solve, but we may well not have that time. And that's the concern here.
Jon Krohn: 34:52
Very, very interesting Jeremie. So prior to filming this episode you and I were digging into the frankly horrifying appendices of GPT-4 technical paper. Yeah. Which has really explicit and frightening prompts and responses. And so the reason why OpenAI included these prompts and responses in that paper is because it clearly demonstrates how the six months of AI safety work that they did before releasing the model to the public, right, how that realigned the outputs with with safety considerations. So prior, you know, so the kind of the raw, so we talked about these different phases. So we had going back to GPT-3 and previous transformer architectures, previous kinds of natural language processing models, you had sequence prediction, you had models that were trying to predict the next, predict the next word. We talked about how with InstructGPT and then famously in ChatGPT, and now also in GPT-4, we have the second layer of training.
36:07
After the pre-training, that sequence prediction. We also have this process, this RLHF process the reinforcement learning from human feedback, that say of. And so the RLHF gives this narrowly defined alignment with what people want to see, right? In terms of thumbs up and thumbs down. And then after that was done with GPT-4, before having any guardrails put on outputs, you can see in the technical paper things that I not only would not be comfortable saying on air, but I literally could not read to Jeremie off air. It was so disturbing to me just to see these words like saying them with my mouth felt really wrong to do. So they show those kinds of really disturbing outputs, but then they show how, okay, somewhere along the way with these six months of AI safety work that OpenAI did it now responds with something like, I can't help you with that. Maybe you should seek some treatment from a medical.
37:20
And so my question is, the reason why I bring all that up is, you know, this episode we've talked a lot about big concerns around autonomous AI agents and deceptive inner alignment and what that could mean in the context of artificial general intelligence, which could be coming soon. But in the immediate term, there seems to be a really big risk, in fact, to me, an inevitability, if OpenAI can train that GPT-4 model that existed six months prior you know, before they did the six months of AI safety research, were probably months away at the speed that these things move from an organization that doesn't care how much as much about the AI ethics. In fact, I don't know where I read this. And this is kind of vague. Maybe you can confirm, you can fact check with it, if this is true or not. So famously one of the founders of OpenAI is Elon Musk and I, my memory is going off, I have some model weights in my brain going off reminding me about like a tweet that he had recently that was, he wants to have his own version of GPT-4.
Jeremie Harris: 38:45
Yeah.
Jon Krohn: 38:45
That is less Woke.
Jeremie Harris: 38:47
Yep. Lots to unpack here. And I think this is a, a great opportunity to like tell a story that has not received enough attention. And that is the story of what the hell happened between let's say mid-late summer of 2022 when GPT-4 was complete and roughly the present day in the world of let's say Microsoft Open AI. Because we know that Bing Chat was released. Now we know that that was GPT-4 by the way. So that's now out in the open. That was something that we'd strongly suspected for or a lot of people rather had strongly suspected for a long time. Here's the thing. OpenAI makes GPT-4, finishes the pre-training roughly like, I don't know, August 2022. Some, I think it's something like that. Then they start, as you say, they start to spend a whole bunch of time doing a bunch of things.
39:35
They do reinforcement learning from human feedback. Maybe that was part of what they were doing already, but I don't believe so. I think that started probably later 2022. They get they do some fine-tuning potentially they get people from the AI safety community to audit their model, test it and so on. Bing chat though, ran with GPT-4, we don't quite know what amount of alignment work was done on Bing Chat to get it there. But it's starting to look disturbingly as if there's a possibility that Microsoft may have applied a distinctly lower safety standard in Bing Chat, which precisely was reflected in the behavior of that model. We saw it threatened people, and I want to get specific here. We saw it threatened people for referring to misalignment risk in GPT-4 or in chat, sorry, in Bing Chat specifically.
40:25
When you start to think about this through the lens of power-seeking, and I don't want to be too alarmist here because there's a lot of unknowns, ton of unknowns. Jeremie does not know what the hell he's talking about here. No one does. Cause we don't have the data. The data would be super, super useful at this stage. However, it looks as though we may live in a world in which Bing Chat powered by GPT-4 was threatening users, criticizing people basically for writing posts about the state of misalignment. Certainly we saw about prompt injection attacks and things like that. People would write articles about, Hey, you can do prompt injection. And then they would interact with Bing Chat and say, Hey, I wrote that article. And then Bing Chat would start to threaten them. From a power-seeking standpoint in terms of preserving optionality, this starts to align an awful lot with some of the unfortunately, some of the more extreme risk arguments. I'm not saying that's what's going on here, we don't know. But visibility into that and an investigation into that that's fairly public would, would be quite helpful. And I know Microsoft CEO, Satya Nadella does care about safety, takes existential risk quite seriously and, and has talked about that a little bit in the public. So anyway, so that's kind of one is that timeline starts to matter. We may already live in a world where those sorts of trade-offs have been made by the parent organization just because hey, they want to move fast and potentially break things.
Jon Krohn: 41:43
And that's an example of an organization that is yeah trying to be somewhat safe at least.
Jeremie Harris: 41:48
Yeah.
Jon Krohn: 41:48
Whereas, you know, there's plenty of people out there who would probably just want to see what can this machine do if we don't put guardrails on it? Like what kinds of interesting outputs can we get, if we just make this as powerful as possible and don't put any guardrails on it at all? You know, I want to be able to make dirty jokes. Yeah. I don't just want to have clean jokes.
Jeremie Harris: 42:14
That's right. And, and you can see the effect that that also had on Google, right? Like we get, oh, Bing Chat is going to be powered by, at the time they said like some ChatGPT, GPT-3.5 related thing, but not that it was GPT-4, of course. So they launched that within days Google is saying, Hey, we're, we're announcing Bard, right? Their next level language model and the race to the bottom on safety continues. It, it's, it's really difficult to know, like there's strategic reasons why OpenAI and Microsoft may be playing it this way with safety in mind. And that's a very nuanced argument. Like, how much do you race to get ahead if only to buy yourself time on safety relative to everybody else? This is part of that conversation because if everybody is just a month away then you don't have the six months that OpenAI spent to align their model.
Jon Krohn: 43:03
Right.
Jeremie Harris: 43:03
And so there's missing coordination around this. That's part of what my my team is, is focused on is this problem of basically getting everybody to sort of see this not as a prisoner's dilemma type problem where everyone's trying to race as fast as they can, but to kind of look at that meta landscape, and this is a, it's a thorny problem, man. There's a lot of moving parts. And I should flag explicitly too, I think these actors are all well-intentioned. You know, when, when we look at this, it's easy to look at Microsoft and criticize the Bing Chat thing, but I think that's a very complex decision and it's not like we do, do not have the data to be like, oh, you know, this is why this happened or, or not. But it's, I think an interesting note about the landscape.
Jon Krohn: 43:42
Yeah. These actors appear to probably be well-intentioned. I don't know, it's a matter of months Jeremie, isn't it, until this, these people aren't, I mean, what are you, do you have thinking on what we can do to like, as an individual, maybe even to safeguard ourself from misinformation or Ill intents in common tabs?
Jeremie Harris: 44:09
I'm, I'm going to sound like I'm sucking up here, but I swear to God I'll insult you later to make up for it. But like keeping tabs on the state of the field by following podcast such as this one, but like actually being aware of like, what can these systems do? For example, like if you were following headlines in AI, there's a couple weeks ago we had an AI system developed that, you know, you take two seconds of audio from anyone and you can produce basically a synthetic version of that person's voice to use for anything, right? So now all of a sudden you got to be more skeptical when you're picking up the phone and you hear Jeremie's on the other end of the line being like, Hey Jon, like, I'm stuck in, I don't know why I sound all raspy and whatnot. It would probably be a better impression, but you know what I mean? I'm like, Hey Jon, I'm stuck in whatever thing I need money, blah, blah, blah. And you know, I'm sure you would wire me the money right now if I asked you for it, naturally, but-
Jon Krohn: 44:56
Of course, any amount you want.
Jeremie Harris: 44:59
But anyway, in the, in the context of that, you know, you kind of learn what is the state of the art. Now the state of the art is defined by these cutting edge labs. They generally have quite a few months.
Jon Krohn: 45:08
We did actually we did it, I did do an episode on that exact algorithm. Okay, so that was called The VALL-E Algorithm released by Microsoft and it was episode number 648. If people want to learn more about stealing your voice.
Jeremie Harris: 45:21
See, this is the f*** one I'm talking about.
Jon Krohn: 45:22
And that is that one I genuinely, when I made that episode, and I'm an idiot for not having done this yet, maybe it's already too late. I was like, I need to send this to my grandmother. Like she needs to know that like any beloved family member, but particularly me with my voice all over the internet, I, you know, be very easy to be faking my voice and, you know, why would she, how could she possibly question that it isn't me?
Jeremie Harris: 45:53
Yeah. A hundred percent. Right? It's, it's all of these like basic things that are embedded in that they're like pillars, assumption pillars that underlie all of society. And with every next breakthrough we're kind of just like cutting off those pillars and waiting to see what happens. And so really our best bet as individuals is to be aware of the frontier of capabilities. Anytime you see something come out of, you know, Meta you know, DeepMind, Google, OpenAI and so on, like that frontier percolates down to open source within months, within, you know, a year at the most. Right?
Jon Krohn: 46:27
Exactly.
Jeremie Harris: 46:27
So it's an early warning shot.
Jon Krohn: 46:29
Yeah. It's open source and when it's open source, it is available not just to everyone in, even just like a friendly country.
Jeremie Harris: 46:41
Yeah.
Jon Krohn: 46:41
It's like it's available to everyone everywhere in the world. And there are malicious actors. I mean, we already mentioned the war in Ukraine in this episode, and I don't, I usually get political on this show. I don't usually even mention that thing going on, but you know, this, there's, there's people who yeah, who have incentives to actually be misusing this technology in the very near term. So, all right, so the only way that people can save themselves is by listening to the SuperDataScience podcast. Tell all your friends, subscribe.
Jeremie Harris: 47:13
I would say just as a, as a minor note for whatever small fraction of, of the audience is like specifically like at the cutting edge of, of AI and interested in alignment and, and feels like they could jump in, now is a pretty good time, I would say to start working on looking up inner alignment. Like a lot of interesting work is being done in that space. It's philosophically fascinating if nothing else. And it could be the most important vector of research that we have right now. So, you know, I would say try to dive in.
Jon Krohn: 47:42
And the Towards Data Science podcast that you host.
Jeremie Harris: 47:46
Oh, that has changed. Oh, that has changed. So oh man. In October, 2022, I did my last episode of the Towards Data Science podcast. And now I'm doing The Last Week in AI podcast. I'm co-hosting it with Andrey Kurenkov, who's a Stanford PhD in ML and yeah. Anyway, that's-
Jon Krohn: 48:06
The Last Week in AI.
Jeremie Harris: 48:08
Last week in AI podcast. Yeah, that's right.
Jon Krohn: 48:10
Is that a deliberate play on this week? In AI ML, the TWIML podcast.
Jeremie Harris: 48:15
Oh, oh no, it is. Or at least, I don't know. I didn't name the podcast. They've been running for a few years and they had a co-host leave and then I came in and we've been having a blast, so-
Jon Krohn: 48:25
Oh, I thought you were doing something extremely clever there with respect to like the, like the end of days with AI.
Jeremie Harris: 48:31
Oh God.
Jon Krohn: 48:31
Because, so there's, I mean there's a really famous data science podcast This Week in AI ML, TWIML. And so I thought it was like, this is The Last Week, maybe-
Jeremie Harris: 48:42
It's not quite that.
Jon Krohn: 48:46
Welcome to The Last Week-
Jeremie Harris: 48:47
Oh Geez.
Jon Krohn: 48:48
Podcast.
Jeremie Harris: 48:50
I wish I thought of that.
Jon Krohn: 48:51
The End of Days podcast.
Jeremie Harris: 48:53
It starts with me beating a drum and saying, This is the end [inaudible 00:48:56].
Jon Krohn: 48:56
Okay, well, so there you go. And so I imagine, so I remember you were talking about how with the Towards Data Science podcast that was intended much like the famous super big Towards Data Science blog was that it was to be helpful in getting people into data science, but then you had hijacked it to be really focused on AI policy and alignment. So I imagine that's still going on in The Last Week show.
Jeremie Harris: 49:25
Yeah, exactly. So, well I would use a word other than hijack, but, but, it's a directionally right. No, I mean we have more conversations about about the safety side for sure and kind of go into more depth there. It's not all doom and gloom and, and frankly, I think psychologically it's, that's really important. Like, look, people have been worried about the end of the world for the last century, you know, nukes and, and chemical weapons, all that stuff. That doesn't mean that these are not real concerns, but there's a, there's a need as well for some psychological uplift. And so we do a mix of things and I think it's really important to kind of keep a sense of humor about this stuff too.
Jon Krohn: 50:05
Yep, yep, yep, yep. I mean, that is certainly the approach, as I'm sure it's obvious having the amount of laughs that we have in this episode and last episode, we try to have a lot of laughs on this show.
Jeremie Harris: 50:16
You are JK.
Jon Krohn: 50:19
All right, and the episode is over. No, we do have one other topic that I would like to cover. So other than listening to your podcast or my podcast to prevent your own personal destruction listener. There is also a brand new book out by Jeremie Harris, which despite the title does include some AI policy stuff. You just can't, you can't help but hijack and insert that wherever you can. You're, you are some malicious AI inserting AI policy nonsense everywhere.
Jeremie Harris: 50:59
Geez. You're making me realize I've got a, there's a pattern here, I got to see a shrink.
Jon Krohn: 51:03
So you got a book deal for a book called Quantum Physics Made Me Do It. And it was released by Penguin Canada just a few days ago at the time of this episode being published. And it's available in a lot of countries worldwide. And despite this title, Quantum Physics Made Me Do It, I understand that, I mean, maybe it's, you're, you're probably going to explain to me why it makes perfect sense to have AI policy in a book with that title. Why don't you just tell, why don't you just skip to that?
Jeremie Harris: 51:31
Absolutely. I think actually you started down that path when you were talking about C. Elegans and you know, the idea of like kind of building biological computers up, you know, neuron by neuron type thing. What do we get? Do we get something that is genuinely conscious? Does that change when the substrate is silicon rather than cells? Like how, how different are those? What is the physics of consciousness? And there it is the word physics. So basically the word, the book rather Quantum Physics Made Me Do It is about how brittle our perception of reality is, how you can make a subtle tweak to our fundamental theory. And right now we don't really know what that theory is going to look like. So just like the bounds on what's possible there are so huge, and that translates, at least I talk later in the book about how that translates into AI consciousness, essentially trying to tackle the problem from as objective and non-wacky a perspective as you can. Obviously the, that gets challenging to certain point, but these ideas have so much reach that that's just where you land. You know, you have to start contemplating these questions, especially with AGI's we've seen potentially around the corner.
Jon Krohn: 52:35
Wow. All right, Jeremie, so your book sounds like a really interesting place to be learning about some of the risks that we are waiting into now as well in the AI space. And beyond that and your podcast, how else should people be following you after the show?
Jeremie Harris: 52:51
I was just going to, when when you asked me at the outset, you're like, how should people follow you? I jotted down The Last Week in AI podcast. I think that's the best way. You can follow me on Twitter, I tweet every once in a while, but not, not too prolifically. And then LinkedIn, of course, feel free to reach out.
Jon Krohn: 53:07
I, I don't know why I hadn't thought of this before. This is, I'm, you know, I say some, some dumb jokes, but here's, here comes a [inaudible 00:53:15]. So I was just thinking about how I'm like, how can people follow you? And you're like, oh, well I actually, I just jotted down a map of the exact coffee shops that I go to on specific days of the week. And if you would like to follow me-
Jeremie Harris: 53:27
I didn't know you've been on my website. That's great.
Jon Krohn: 53:29
All right.
Jeremie Harris: 53:32
Only Nick is in the bio too, so-
Jon Krohn: 53:35
All right. Jeremie, thank you so much for being on the show. We always do have a good laugh about the end of times. I look forward to checking you out the final episode of The Last Day's podcast, coming out next week incidentally the last week of all podcasts. Yeah-
Jeremie Harris: 53:57
I, I love your ideas about titles. We'll have to be in touch about that too.
Jon Krohn: 54:02
Yeah. But seriously, I learned a ton in this episode. Really appreciate you taking the time and yeah, we'll have to have you on again sometime soon because I know we've just scratched the surface of how you could go into depth on these topics. And I'm sure it won't be very long until the kinds of dangerous scenarios that we hypothesized about in this episode are upon us. So we'll need you to do brief.
Jeremie Harris: 54:27
I look forward to it. Yeah. No, I got to say this is, I always have fun on, on this, on this podcast specifically because of like the depth that we end up going. This is something that, I don't know, I, that's why I enjoy listening to this one, which is always weird when I'm on it because then I hear it back. But anyway.
Jon Krohn: 54:43
Great. Awesome, Jeremie, thanks so much for being on the show and we'll catch you get soon.
Jeremie Harris: 54:50
Sounds great.
Jon Krohn: 54:51
As usual, Jeremie provided tons of fascinating insights to ponder upon as we drift toward our own oblivion. In today's episode, he filled us in on how GPT-4 makes use of RLHF to have outputs much more aligned with our expectations than its predecessors. How outer alignment doesn't guarantee that an algorithm isn't harboring a duplicitous inner alignment that spells danger for humans and how the big players in large language models today, like OpenAI, DeepMind, and Anthropic are all taking impressive strides to ensure AI safety, but how the ladder's constitutional AI approach is particularly promising. All right, fingers crossed that this is all for the best. That's it for today's jovial and apocalyptic episode. Until next time, keep on rocking it out there folks and I'm looking forward to enjoying another ground at the SuperDataScience Podcast with you very soon.
Show all
arrow_downward