81 minutes
SDS 701: Generative A.I. without the Privacy Risks (with Prof. Raluca Ada Popa)
Subscribe on Website, Apple Podcasts, Spotify, Stitcher Radio or TuneIn
Confidential computing, securely using generative AI APIs, and the ongoing debate of open-source versus closed-source AI development are top topics of discussion this week as we welcome Dr. Raluca Ada Popa, renowned computer scientist, entrepreneur, and President of Opaque Systems, to the podcast. Together with Jon Krohn, she talks about the benefits and complexities of operating compute pipelines across multiple cloud providers.
Thanks to our Sponsors:
Interested in sponsoring a SuperDataScience Podcast episode? Visit JonKrohn.com/podcast for sponsorship information.
About Raluca Ada Popa
Raluca Ada Popa is an Associate Professor of Computer Science at UC Berkeley, where she specializes in computer security and applied cryptography; her papers have been cited over 10,000 times. She is Co-Founder and President of Opaque Systems, a confidential computing platform that has raised over $31m in venture capital to enable collaborative analytics and A.I., including allowing you to securely interact with Generative A.I. platforms. Raluca previously co-founded PreVeil, a now-well-established company that provides end-to-end document and message encryption to over 500 clients. She holds a PhD in Computer Science from MIT.
Overview
Large Language Models have indeed been in the spotlight lately for their innovation, but it's essential to address the data privacy and security issues that come with these tools, as they are sparking important conversations as well. Here to shed light on this crucial aspect of the rapidly expanding developments in deep learning is Dr. Raluca Ada Popa. With her expertise in data privacy and security, particularly concerning Large Language Models (LLMs), she provides valuable insights as we navigate this exciting time in technology.
Raluca explains the fundamentals of confidential computing and its value proposition in terms of performance. She talks about performing inference with an LLM or even training one, all while preserving data privacy. Even the LLM developer wouldn't be able to access the data, making it a significant contribution to privacy concerns. Raluca goes on to talk about several eye-opening topics, but the real highlight of the episode is her explanation of how hardware enclaves on CPUs and GPUs can secure data for analytics, inference, or model training. This technology prevents anyone, even the model owner, from viewing unencrypted data.
She also touches on how to use commercial generative AI APIs, like OpenAI's GPT-4, without risking exposure of sensitive or personally-identifiable information included in API queries, and explains the benefits of Opaque Systems, a confidential computing platform that has raised over $31m in venture capital to enable collaborative analytics and A.I., of which she is Co-Founder and President.
What’s more, she discusses the pros and cons of open-source versus closed-source AI development, providing valuable insights into both domains. Raluca advocates for an open-source AI development model, as it could potentially lead to better security measures, an aspect critical for technology handling vast volumes of data.She also delves into how Berkeley's SkyLab is facilitating efficient compute pipelines across multiple clouds for confidential computing.
Raluca's insights and experiences make this episode a must-listen for anyone interested in data privacy, AI development, and the blend of academia with entrepreneurship.
In this episode you will learn:
- What is a confidential computing platform? [04:31]
- How to get started with confidential computing [12:10]
- The challenges of confidential computing and LLMs [21:11]
- How to safeguard your data while using commercial LLMs like GPT-4 [38:00]
- Open-source vs closed-source [52:28]
- Raluca's PreVail cybersecurity company [1:01:50]
- Combining entrepreneurship and academic career [1:04:03]
- DARE Program [1:10:39]
Items mentioned in this podcast:
- AWS Trainium
- AWS Inferentia
- Modelbit
- ODSC East
- Opaque Systems
- Berkeley SkyLab
- Visor
- UC Berkeley's DARE
- Vicuna
- SDS 670: LLaMA: GPT-3 performance, 10x smaller
- SDS 672: Open-source “ChatGPT”: Alpaca, Vicuña, GPT4All-J, and Dolly 2.0
- SDS 674: Parameter-Efficient Fine-Tuning of LLMs using LoRA (Low-Rank Adaptation)
- Midjourney
- SDS 609: Data Mesh
- PreVeil
- Symposium on Security and Privacy
Podcast Transcript
Jon Krohn: 00:00:07
This is episode number 701 with Dr. Raluca Ada Popa, Associate Professor at Berkeley and President of Opaque Systems. Today's episode is brought to you by AWS Cloud Computing Services and by Modelbit for deploying models in seconds.
00:00:19
Welcome to the SuperDataScience podcast, the most listened-to podcast in the data science industry. Each week, we bring you inspiring people and ideas to help you build a successful career in data science. I'm your host, Jon Krohn. Thanks for joining me today. And now let's make the complex simple.
00:00:50
Welcome back to the SuperDataScience podcast. Today we are very fortunate indeed to have the renowned researcher and entrepreneur Raluca Ada Popa on the show. Raluca is an associate professor of computer science at UC Berkeley, where she specializes in computer security and applied cryptography. Her papers have been cited over 10,000 times. She's also Co-Founder and President of Opaque Systems, a confidential computing platform that has raised over 31 million in venture capital to enable collaborative analytics and AI, including allowing you to securely interact with generative AI platforms. She previously co-founded PreVeil, a now well-established company that provides end-to-end document encryption to over 500 clients. She holds a PhD in Computer Science from MIT. Despite Raluca being such a deep expert, she does such a stellar job of communicating complex concepts simply that today's episode should appeal to anyone who wants to dig into the thorny issues around data privacy and security associated with large language models and how to resolve these issues.
00:01:50
In the episode, Raluca details what confidential computing is and how to do it without sacrificing performance. She talks about how you can perform inference with an LLM or even train an LLM without anyone, including the LLM developer being able to access your data, how you can use commercial generative AI APIs like OpenAI's GPT-4 without OpenAI being able to see sensitive or personally identifiable information you include in your API query. She talks about the pros and cons of open-source versus closed source AI development. She talks about how and why you might want to seamlessly run your compute pipelines across multiple cloud providers, and she fills this in on why you should consider a career that blends both academia and entrepreneurship. All right, you're ready for this awesome episode. Let's go.
00:02:40
Raluca, welcome to the SuperDataScience podcast. Thank you for coming on the show. How you doing today?
Raluca Ada Popa: 00:02:45
I'm doing great. Thank you for having me. Excited to be here.
Jon Krohn: 00:02:49
Yeah. And so where are you calling in from today? We met in Boston at ODSC East. And I know that you spent a lot of your life in the Boston area.
Raluca Ada Popa: 00:03:01
Yes, I did.
Jon Krohn: 00:03:02
But now I think you're on the West Coast.
Raluca Ada Popa: 00:03:04
Yes, yes, I'm in the Bay Area right now, but I did my education at MIT in Boston, four degrees. So, a long time.
Jon Krohn: 00:03:12
And perfect GPAs all along the way.
Raluca Ada Popa: 00:03:17
Yes.
Jon Krohn: 00:03:20
And so congrats. At the time of recording, you recently wrapped up the Confidential Computing Conference which I guess was, was that in San Francisco?
Raluca Ada Popa: 00:03:30
Yes, it was also in San Francisco. Absolutely. It was last week. Yeah, it was a very exciting conference. It's the very first in the confidential computing space, the very first in-person conference. It's, as you know, a very new area, disruptive technology. And so it was really amazing having this large, growing, fast growing community come together in person.
Jon Krohn: 00:03:54
Yeah. My understanding is that it was oversubscribed. And so the one day conference this year, you'll be expanding to a two day conference in 2024.
Raluca Ada Popa: 00:04:01
Yes, absolutely. We were targeting for, since it's the very first event and it's in person, and we wanted to really make sure the, you know, the attendees are extremely relevant to the conference. We were targeting up to, let's say 150 to start with, and we got 250. And what was interesting was that almost everybody showed up. So, there wasn't much of a, you know, over registration, not showing up. Almost everybody picked their badges up, so.
Jon Krohn: 00:04:28
That's amazing. Congrats. So in addition to running very popular conferences, you're mostly consumed, I imagine, by being President and Co-Founder of Opaque Systems, which is a confidential computing platform for collaborative analytics and AI at scale. And we're going to dig into what will probably be the most interesting aspect of that for our listeners shortly, which is related to generative AI and LLMs and all of the kinds of privacy issues that are associated with that. But first, let's dig into what confidential computing is in general. So, tell us what a confidential computing platform is and how it works.
Raluca Ada Popa: 00:05:08
Absolutely. So, the idea is that you can compute in the cloud, let's say you can do analytics or machine learning, but while keeping your data encrypted. So, there's a lot to unpack there. So, the idea is that your data, confidential data will be protected by encryption, but at the same time, you can still do machine learning and data analytics on it. The cloud can see your data, so you can move your sensitive workloads in the cloud. Cloud employees cannot see it, hackers breaking into the cloud, even if they get through to access on the machine, they can't see your data because it's in encrypted form at all times during processing.
Jon Krohn: 00:05:48
That's really interesting. So, how is it that it's encrypted all of the time in processing, but yet we could still do analytics on it or we could train a machine learning model on it?
Raluca Ada Popa: 00:05:59
Yeah, yeah, very good question. So, I've done research in this space that, you know, UC Berkeley in my PhD at MIT and now, you know, taking it to the company. And there's really two main approaches. One is using some in advanced cryptographic mechanism that can actually compute on the data. And that one works in some settings, but it's still very slow for machine learning training and analytics. There's the other approach, the more recent one confidential computing, which is based on hardware enclaves. So, we're talking about specialized hardware that is now available in the cloud, which is awesome. And the idea is that this hardware enclave is in the processor, in the physical chip, creates like a hardware box. Now in that box, the data actually gets decrypted, but only the metal can see the data. There's no human physically in that CPU box. Yeah. And as soon as it goes outside of the CPU box, basically outside of the die of the processor, there's an encryption key that's fused into the hardware that encrypts the data as soon as it comes, right before it comes up. So, outside of this box and outside of the die the data is encrypted on the bus in memory. So, anywhere a human can access and look at it, it's an encrypted form.
Jon Krohn: 00:07:17
This is a hardware solution, not a software solution.
Raluca Ada Popa: 00:07:20
Very good point. Yeah. Yeah. So, it, the security and the trust comes put stop from the specialized hardware, but then you need the software stack on top of it, which is what Opaque provides to be able to work with this hardware. Because to work with this hardware requires a lot of expertise, like expertise on how do you manage the keys for this hardware. How do you, there's something called Attestation where you can check that the hardware deployment was done correctly while being remote. There's things like ensuring end-to-end policies on who can see the result of, let's say the model you trained or the prediction results. Then there's the scale, how do you scale with this hardware enclave? So, there's this big software stack that you need to run on top of this hardware in the cloud if you wanted to be frictionless. So, the whole idea for Opaque is to provide the software stack so that data scientists, you know, can use this without any friction. They don't have to think of confidential computing. You don't have to think of how hardware enclaves work. They can just use the software stack and run their usual pipelines Unchanged, and then they don't have to think about all the complexity.
Jon Krohn: 00:08:26
Very cool. I think I'm starting to understand it. So, the enclave part of it, is that like the hardware specifically or it's kind of the whole system incorporating the software?
Raluca Ada Popa: 00:08:36
Yeah, very, very good question. So, yeah, the enclave is the hardware piece. It's the hardware enclave, it's specialized hardware provided by Intel, by AMD, ARM. So, all the major, most of the major hardware vendors right now have a hardware enclave technology, and it comes shipped with the modern server architectures. So, for example, Intel Skylake has it. So if you look at AWS, if you look at the, you know, GCP or Azure, they have this hardware enclaves already, you know, available in the data centers, and you can just choose VMs that have that enabled. But then to, you know, to handle all this complexity of how you deal the expertise that you need, we have the software stack that Opaque provides so that the user doesn't have to think, doesn't have to worry, oh, am I using enclaves or not? How do I do with the keys? How do I push my data in that you know, how do I do scale? Now, they don't have to worry about any of this. They just run their usual, their usual system. So, it's like synergy of software and hardware, but the hardware enclave itself just provides the hardware part.
Jon Krohn: 00:09:45
I gotcha. And so I imagine something else that the software is critical for is ensuring that this encryption is very efficient. Like I suspect that there's a risk here that if you have all of this encrypted data and you're training like a large language model, there's a huge amount of information in each round of training of stochastic grading descent that is going to have to come out of encryption. And then have outputs be decrypted again, very rapidly. So, yeah, I'm guessing from your head nodding.
Raluca Ada Popa: 00:10:15
Absolutely. Yeah. Yeah that's, that's a, that's absolutely very good question. And it used to be normally that encryption would add a significant overhead in performance, but the nice thing is that right now this encryption is in hardware, it's called AES-NI instruction, which encrypts extremely fast. So, it's almost never a bottleneck. Depends on your workload. It can add, you know, a small, a few percentages in performance overhead in other, really the main, actually performance overhead comes from going in and out of the enclaves. So, you have to have well, you know, designed system. And that's another thing that, you know, our software stack helps with. But the other exciting thing, because you mentioned generative AI and you mentioned the large amount of data. We now have GPU enclaves thanks to NVIDIA. So, this hardware enclave used to be just on the CPU and a lot of the solutions they are for the CPU, but this fresh off the press NVIDIA is providing hardware enclaves in their Hopper architecture. And this is, I'm not advertising NVIDIA, I'm just, but I am generally super excited about this. It's like, awesome. So, now we can actually run GenAI at performance that is similar to not encrypted, an insecure computation.
Jon Krohn: 00:11:33
Nice. That is a very cool development. And yeah, as we know, GPUs are critical to being able to train. You can't be training LLMs, even the smallest LLMs without GPUs. It takes forever. So, that's, that's a big change. That's awesome.
Raluca Ada Popa: 00:11:52
Yeah, it's very exciting.
Jon Krohn: 00:11:54
And so, yeah, so it sounds like people can be using various cloud providers, all the major cloud providers provide these enclaves. Now NVIDIA has GPUs that allow for these enclaves. So, then how does somebody, if somebody's listening to the show right now and they're like, oh, this is essential for me. I need to be using those, these enclaves I need to be ensuring that all of my computing is happening confidentially. How does somebody do that? I, I'm guessing it's like using Opaque systems. I don't know, you tell me like, what's the full, like what do, do I go to Opaque Systems or do I go to AWS or like ... Yeah, how do I get started?
Raluca Ada Popa: 00:12:31
Yeah, that's, that's a very good point. I would say that especially for you know, companies or users that don't have expertise in confidential computing, encrypted computation, the right answer used to go through Opaque because you removes all this friction and all this expertise required. You don't have to think of how do you set up the enclaves? How do you move your compute in the enclaves? There's key management that has to happen. Then there's, you know, a testing the enclave environment. There's like a whole suite of, you know, things that require expertise. So, the writing is to go to Opaque Systems and we can help, you know, deploy your system and your pipeline in enclaves, so, you have no friction. Now, if somebody has a lot of expertise and they say, oh, I really want to build in my system from scratch, from the very beginning using enclaves, then they could just, you know, start in a more raw manner directly on the clouds.
Jon Krohn: 00:13:27
This episode of SuperDataScience is brought to you by AWS Trainium and Inferentia, the ideal accelerators for generative AI. AWS Trainium and Inferentia chips are purpose-built by AWS to train and deploy large-scale models. Whether you are building with large language models or latent diffusion models, you no longer have to choose between optimizing performance or lowering costs. Learn more about how you can save up to 50% on training costs and up to 40% on inference costs with these high performance accelerators. We have all the links for getting started right away in the show notes. Awesome, now back to our show.
00:14:08
All right. Gotcha. So, people who already have a lot of experience, they could potentially be going directly to a cloud provider or doing it maybe on their own in-house, their on-prem cloud infrastructure. But for people who don't have experience with confidential computing, like me and probably a lot of our listeners, the fastest route to confidential computing is using a platform like your Opaque Systems. Very cool. Something that you mentioned associated with Opaque Systems is that it's for collaborative analytics and AI. So, what does that mean? How is that like in encoded as part of this platform?
Raluca Ada Popa: 00:14:45
Yeah, very. That's, that's, that's a great question. I realize I'm saying that a lot, but you're asking good questions, so-
Jon Krohn: 00:14:51
Thanks.
Raluca Ada Popa: 00:14:53
So once you have this encrypted data processing and your data is secure and protected during computation, it's supernatural to collaborate and to collaborate with sensitive data. So, for example, there's so many scenarios where, let's say banks want to collaborate to find money launderers, but without sharing data with each other, because they're obviously in competition. Or healthcare organizations want to collaborate, for example, to find, you know, Covid patterns or, you know, evaluate the spread of Covid or, you know, find better diagnosis and treatment for cancer. But they cannot collaborate because again, patient data is very confidential. So, it's super easy once you have a confidential deployment like this in the cloud for everybody to upload their sensitive confidential data there. And you can run the joint analysis, maybe train a model jointly or do, you know, aggregate statistics without worrying that the other part is gonna see your data, or really anybody not the cloud, not hackers.
00:15:59
And then you produce the final result and then they can all enjoy it. They can all enjoy, let's say, a better cancer prediction or treatment. So, it's a very, very natural you know, thing to do to collaborate with confidential data once you have a safe place to do it. And the way Opaque enables it is it takes care of all the key management associated because each one of these parties, let's say each bank or hospital, are going to have the data encrypted under different keys. So, it's handling all that key management. Also, they're gonna want different policies. For example, a bank might say, oh, I'm only willing to reveal, you know, the result if it's, if it's, for example, if it's quality based on some testing dataset only then I'm gonna reveal the final model, or I'm only gonna reveal certain information about, about my data and nothing, nothing else. So, it helps ensure this end-to-end policies.
Jon Krohn: 00:16:53
Oh, cool. So, there's also, there's some level of discretion as to which data need to be private and other data that maybe can be shared. So it's not like everything that goes through a confidential computing system must always be encrypted all the time.
Raluca Ada Popa: 00:17:08
Yeah, exactly. So, some things might not need to be encrypted, but you can also encrypt everything just to keep it easy. And we actually do recommend that. But I think where it's important to have this discretion, this policy is in the outcomes that you take outside of the enclave. So, for example, let's say that you and I are, want to train an aggregate model together because we think that we're gonna get the better model than training on our own data. But then I'm saying, "Hey, what if the model helps you more than it helps me? Maybe I'm not okay with that," right? So, we can set policies saying, let's test this model on some testing data which provide while staying encrypted. We reveal it to the parties before we reveal it, and if it passes a certain accuracy that both are happy with only, then it's gonna come out of the enclave. And be visible. So, you can set very, very advanced policies and yeah.
Jon Krohn: 00:18:04
That is super cool. That is not something that I like, that is well beyond the kind of questioning about this space that would've come to me organically. So, thank you for bringing that up. Yeah. That's, what kinds of scenarios are there like case studies where you could imagine that like practically that would happen? Like I guess maybe some different banks are collaborating on catching fraud or something and I don't know, some, it's like one bank provides most of the data and so I don't know, help me, help me, like make this example.
Raluca Ada Popa: 00:18:35
Yeah, absolutely. Absolutely. So I would say there's situations where yeah, let's say for example, we want to catch money launderers, right? And the model is, you know, that that indicates what are the suspicious accounts, let's say maybe wants to say, oh, in this bank there's something like 20,000 or 30,000 suspicious account and or maybe a hundred thousand suspicious accounts. And then [inaudible 00:19:01] says, wait a minute, that's too much. You know, I don't want to reveal a hundred thousand of my customer's accounts because it's too much of my data. It sounds like something is not right with this model. I would like a model that's much more performant and much more accurate, so I only have to reveal maybe, you know, 5,000 accounts that are really suspicious, something like this. So, you can kind of control how much data comes out from the model.
00:19:26
Other interesting cases that we're seeing is actually in Ad tech, where we see a bunch of users is people, you know, in Ad tech you can buy data sets that are appropriately, you can buy them and they tend to be expensive, right? And then they help you with your own business and maybe user recommendations. But sometimes you want to know, "Hey, is this dataset gonna be good for me?" before you buy it. You know, is it gonna be useful or not before you buy it? Then the nice thing is that inside confidential computing you can test it and see if it's efficient, you know, what you want to do or not.
Jon Krohn: 00:20:01
Oh, that's really cool. Yeah, that's a really great use case. That makes perfect sense. And again, something well out of what I would've thought about.
Raluca Ada Popa: 00:20:09
So, it's interesting that you only need to pay for, you basically only need to pay for a query, not for the whole data set, which is not something that's really possible today, right. You can just really pay for a query and see if you're happy with the outcome of that query, you know, how good the data set is for the specific query, and then you can maybe purchase the whole data set. But it's nice that the platform enforces the fact that you cannot get any more information about the data than you should.
Jon Krohn: 00:20:38
Yep, yep. Super cool use case. So all right, let's start to get into this into this new realm that everyone excited about generative AI, large language models. What are the specific considerations that we need to have other than, you know, we already talked about this idea of efficient un-encryption and re-encryption, I guess, and it's sounds like that's something that is, is a solved problem in the encryption space. It doesn't sound you mentioned specifically that that isn't the bottleneck. So, what are the kinds of limitations or challenges associated with confidential computing when we're training big AI models, like the large language models that are everywhere now in generative AI?
Raluca Ada Popa: 00:21:21
Yeah, absolutely. Absolutely. And before I mention the challenges, I can mention a little bit the goals of what ideally would be amazing to have, is we want to, we want the prediction, so the queries that users ask, you know, we want the users to have privacy, right? This queries could be, for example, uploading proprietary code from an enterprise IP like the case of Samsung, where an employee uploaded proprietary code to ChatGPT and basically revealed company secrets to ChatGPT in order to get some help to be more productive. But ideally we'd like that to be protected. So, with confidential computing, the nice thing is that you can put the model in an enclaved LLM, provider can put LLM model in an enclave in the cloud, and then users can send encrypted queries to the enclave.
00:22:14
And so the LLM provider cannot see those queries. There's a guarantee from the hardware and enforcement from the hardware and encryption. There's no way they can see what those queries are, yet they can still process the inference with high performance and response. So, now all these organizations and their employees can use ChatGPT or all the other, you know, LLMs without being concerned the LLM providers use proprietary company data. So, that's one really big goal. Another goal is to maybe if you are training from sensitive data sets and confidential data that the LLM providers shouldn't see, then you can actually do this training also in the enclaves.
00:22:54
So, what are the main challenges left? I would say that right now the confidential computing community is really well established around the CPUs. The GPUs are much newer, so is the Hopper architecture, which is much newer. They're getting right now available on the cloud. They're about to be available in previews. So, it's super new in the cloud. So, I'd say there's still this incorporating the GPU enclaves in the cloud and then just optimizing these pipelines and workflows. It's something that we still need to, you know, work and engineer. So, on the Opaque side, we've been doing a lot of this engineering even and thinking even outside of the cloud yet. But the nice thing is to have them on the cloud and really get the best performance and that, that one can deliver. So, I'd say it's still a little bit of unknown there because this GPU enclave technology is very new. In the CPU world we know how things are very well, but we need GPUs here.
Jon Krohn: 00:23:52
So, in order to be able to get ahead of what consumers will need, does this mean that you at Opaque Systems potentially get access to like these new Hopper architectures and you get to like, try them out and figure out how you're gonna do confidential computing and have your pipelines and workflows in the cloud be efficient ahead of time before consumers have access to that, to that kind of chip?
Raluca Ada Popa: 00:24:14
Absolutely. And so with the peak systems, a lot of our technology was built on top of our open-source and research in our lab at UC Berkeley. And for example, this was even years before the NVIDIA enclaves became available. We had the partnership with Microsoft where we did research on how we do machine learning, neural networks and, and analytics on a GPU architecture, GPU enclave architecture. It was all in simulation mode, but it still teaches you a lot about how to build the system on top of that. So, we have a few years I would say advance because of that research at UC Berkeley,
Jon Krohn: 00:24:54
Can you dig into that a bit more? I find that really fascinating for some reason. So, maybe our listeners will too. So, I had never thought of this before. Oh, actually, I guess I had kind of in the realm of quantum computing, for example, you'll have, like I know that there are tools you can go and use online where you can see what it would be like. You have these simulations of what it would be like to be using quantum computing to solve a particular problem. So, you can kind of get used to the, to the feel of that, even if it doesn't have the actual performance benefits of quantum computing. So maybe this is kind of similar to that, but, but that's, I I admit it had never occurred to me that there would be other kinds of applications. So, I'd love to hear more about it.
Raluca Ada Popa: 00:25:36
Absolutely. Yeah. So, for whoever is interested, also, we have a paper called Visor. It's published at the Top Security Conference [inaudible 00:25:43] security. So, I mean, the idea is that first you need to know what is the API that the GPU enclave is going to expose, and then on top of that API you can build your system. And in simulation mode you can still get a pretty good sense of the performance It's not gonna run for real, but you're gonna get a pretty good sense of maybe how many cycles it's gonna take and some other metrics that you can then easily kind of do an estimation to the, for the real hardware. Of course, there's some differences between the simulation architecture we use Graviton and the real NVIDIA enclaves, but it still prepares the thinking for the real hardware. So, it still gives us an advance having had the chance to work in that simulated environment.
Jon Krohn: 00:26:30
Nice. Okay. Well I won't take us off on that tangent for too long, Raluca, but that was fascinating on the simulations. Thank you. So back to LLMs. So at my company Nebula, we we have generative AI models, we have our own proprietary LLMs. And so I can see how these kinds of solutions you're describing where our clients can be using LLMs without us, without my company, Nebula, being able to see any of their data so they're, they can feel confident in their employees sending data to us, which is important. So, we hear sometimes when we are prototyping a new generative AI capability, we'll use something like an OpenAI API GPT-4 to prototype it and make sure that it works well. But then in pitches in client calls, prospective client calls, they say you know, you're going to have to be using your own LLMs.
00:27:26
You won't be able to send our data off to that third party API to OpenAI. And it's because of things like people being burned. Like the Samsung example that you gave that actually comes up in our client calls frequently. And it's interesting because while the OpenAI license is quite permissive for ChatGPT actually for the API, they don't, unless you specifically opt in, they don't store your data from more than 30 days. And even just that 30 days that they do store it for is for the purposes of monitoring for abuse. So, but I, but you can see how people are like, oh no, that's a company we need to avoid. However, so going one step further, we already know some people want to be avoiding sending things to OpenAI specifically, but there are surely lots of people out there who want to avoid sending even data to me, like to our own LLMs.
00:28:24
So yeah, there's a lot of our episodes recently on the show have focused on various open-source architectures that you can download and then fine-tune. And we've had tons of episodes about techniques like low rank adaptation LoRA, these parameter efficient, fine-tuning techniques for being able to take pre-trained, really powerful open-source LLMs, fine-tune them to specific tasks. We've had, my data science team is regularly demoing things to me that I'm like, "you built this, we have this", it blows my mind what's possible today. And probably regular listeners might even be tired of hearing me say it. But anyway, it's, so this light bulb has gone off for me that, particularly for our enterprise clients, I'm sure there are some out there that if we said we're using Opaque Systems to take your data and we can't see it, it's just the bare metal enclaves that are gonna have access none of your proprietary information is going to be even possibly seen by us, by hackers. I can see the huge advantage to that in pitches. And I don't know if I've left you with anything to say.
Raluca Ada Popa: 00:29:37
You got it very, very right. You're very right. Absolutely. So, I think that's really, I think a lot of the power of the LLMs will come when used proprietary company data, not just the, you know, public web data. And exactly like you said with Opaque and confidential computing, you can put the foundational model inside the LLMs and you can do your fine-tuning in there on very confidential customer data and then you can still keep the model in there and answer prediction results that way.
Jon Krohn: 00:30:13
Yeah, that's even a step further than what I just said, because I was just thinking about inference time. So, I was just thinking, oh, at inference time with a confidential computing system, I would be able to say to my clients, you know, we won't get access to your data, but yes, I didn't, yeah, I could be fine-tuning on their data without seeing their data. That's wild.
Raluca Ada Popa: 00:30:30
Absolutely. You could be running any part of this in confidential computing. So, you could protect user queries by keeping them encrypted and doing the inference encrypted. But you can also do the fine-tuning in encrypted form. So, you can have the whole pipeline encrypted and then the customer really is assured that nothing about their data is revealed to you or OpenAI or anybody else. But again, there's flexibility if they just want to, if you just want to run some part of it, for example, you say, "Hey, I really want to see that model I'm fine-tuning because I, by seeing it, I can do lots of other cool, interesting tricks and whatnot." You can, and then you can keep the inference only in encrypted form and protect the user queries. Or you can have the model fine-tuning also be in an encrypted form and say, I want to protect the data that it's fine-tuning using, but then I'm gonna see the model or maybe not see the model. You see, you have the perfect flexibility here of what you want to keep protected and what you want to see and who can see it.
Jon Krohn: 00:31:26
Nice. Yeah, really exciting. You're opening my mind to lots of possibilities right now in real time. It's awesome.
Raluca Ada Popa: 00:31:32
We actually have, so we built a prototype like this in Opaque using the Vicuna open-source model. We actually developed Vicuna in our lab in the SkyLab at UCBerkeley. It's a fine-tuned LLaMA basically, but it's we also, in our lab we're running this chatbot Arena where we have users compare and contrast what are the best open-source models and actually even compared to ChatGPT closed source. And so Vicuna, our open-source model is the best performing from, from the open-source one. So, that's quite exciting.
Jon Krohn: 00:32:10
Deploying machine learning models into production doesn’t need to require hours of engineering effort or complex home-grown solutions. In fact, data scientists may now not need engineering help at all! With Modelbit, you deploy ML models into production with one line of code. Simply call modelbit.deploy() in your notebook and Modelbit will deploy your model, with all its dependencies, to production in as little as 10 seconds. Models can then be called as a REST endpoint in your product, or from your warehouse as a SQL function. Very cool. Try it for free today at modelbit.com, that’s M-O-D-E-L-B-I-T.com
00:32:48
Yeah. Vicuna is a recurring topic on the show, and for listeners that aren't aware of it episode number 670 was dedicated to LLaMA and then episode 672 was dedicated to Vicuna, Alpaca, GPT4All-j, DALL·E 2.0, these various open-source single GPU LLMs that are available out there. And I guess while I'm on it, then episode 674 was about this parameter efficient fine-tuning with LoRA that we were also talking about a few minutes ago that allows you to take these open-source models and fine-tune them. So, Vicuna for example, they Berkeley, you took the, this LLaMA, these LLaMA weights that Meta didn't open-source but provided to researchers including I can imagine Berkeley in order to fine-tune. And it's, it's really mind blowing like the amount of data required to fine-tune so effectively to something like Vicuna that you're saying, you know, gets these state-of-the-art performance on benchmarks like you have in the Arena that you mentioned.
00:34:07
It's wild how like just a few hundred or in most cases a few thousand examples.Yeah, like the, it it's, it's like the flexibility and capability of the LLMs out-of-the-box with this just this little bit of fine-tuning on top. I think perplexity is like the best word for that, right, where you get these, these emergent capabilities on such a small amount of fine-tuning that is just stunning. Like typically we're seeing you know, my data science team on typically like the second or maybe the third iteration on trying to be able to do some task, absolutely nail it.
Raluca Ada Popa: 00:34:48
Yeah, yeah, I agree. It's, it's really impressive. Also I do want to mention that Vicuna is developed by my colleagues in the same lab-
Jon Krohn: 00:34:56
I know, I know-
Raluca Ada Popa: 00:34:57
Here, I am doing the security report. I'm-
Jon Krohn: 00:34:59
When I said you, it was a very royal you it was-
Raluca Ada Popa: 00:35:02
Thank you. I appreciate it.
Jon Krohn: 00:35:03
Yeah, it's, it's very good of you to make sure that I'm not implying-
Raluca Ada Popa: 00:35:09
My, my colleagues Joseph Gonzalez, Ion Stoica and students are absolutely amazing doing the Vicuna and I'm taking care of the security and confidentiality part. One thing that's very interesting there is that even when we have open-source models, a lot of users, I'm not gonna want to, you know, develop their own infrastructure to run them and to do the inference. We know it's costly, we know it's expensive. Inference is actually more expensive than training because of the shared volume of queries. So, even in that case, they actually might want to put these open-source models on some cloud, right? On some or, or with some provider, right? To provide a service. So, yet again, we're back to the situation where some provider can see our queries, even though in principle we could run and host ourselves as open-source. It just sometimes takes lot of effort. That's another place where confidential computing can really come into place extremely handy. And for example, if you put this open-source model inside Opaque in the cloud, then you don't have to worry about creating the infrastructure for hosting LLM. You don't have to worry about the fine-tuning on your sensitive data because it's all protected and can run in the clouds. You don't have to set up this infrastructure on-prem. It's again about making this very easy to use and frictionless.
Jon Krohn: 00:36:27
Nice. Yeah, all really great points. And it's interesting we guess we don't say enough on air, the point that you just made about how inference is typically more expensive. I mean, you hope that whatever platform you're building with these LLMs or whatever feature you're building is gonna take off. But if it does take off, then it's like, you know, you think about these potentially huge costs of training LLM, especially if you're training from scratch. If you're taking an open-source LLM and fine-tuning it, the the PEFT can actually be extremely inexpensive. You could be talking about hundreds of dollars or thousands of dollars in compute to train the model, but then it's at the inference time that it's gonna be crazy.
Raluca Ada Popa: 00:37:06
Absolutely. So, then you have to have a platform, right, that supports this. And then let's say you are a company that's, you know, mid-size or smaller and you just, your business is not about setting up infrastructure for your LLM, right? That's not your focus. That's not what you want to spend time on. That's not what you want to hire for or, you know, build infrastructure. And the idea is you just want to kind of delegate it for ease of use. So, then again you're gonna delegate it to some other provider or to a cloud. So, that's where, once again, you will need to think about the confidentiality of your queries, of the data you fine-tune using, the resulting fine-tune model because you know, the way it can also leak information. Many times neural networks remember too much, right, and so forth.
Jon Krohn: 00:37:48
Awesome. Well, so we've now talked a fair bit about open-source LLMs and how people might not want to share information with the big providers like OpenAI. But what if you do, what if you want the best cutting edge API capabilities, the best generative AI capabilities. So, you want to use Midjourney, you want to use the OpenAI GPT-4 API. If people are insisting on doing that, then are there steps that they can take to help safeguard their privacy anyway?
Raluca Ada Popa: 00:38:23
Absolutely. Very, very good question. And really then the question is, one of, what if a query contains PII information and you're sending a query with PII information to GPT-4, and I can completely understand why people might want to still use GPT-4. It's extremely good, right? So, one of the solutions we're working on at Opaque, we should have in the next quarter, it's not yet ready, is this idea of stripping off the answer, stripping from the quarry, the PII information. So, we're using again LLM-like model to figure out what's the PII information, but it doesn't have to be as good as GPT-4 to answer the query and replacing it with symbolic information so that GPT-4 can still reason in terms of that information provide to us a helpful answer. And then we replug back in to that answer, replace the symbols with the information and also at times we tag the query with relevant context.
Jon Krohn: 00:39:25
All right, talk me through an example, like give me an example so I can wrap my head around that.
Raluca Ada Popa: 00:39:32
Absolutely. So, let's say that I have the medical report of a patient that contains SSN, name, date of birth and I want to upload this whole document to GPT-4 that contains a lot of medical information. And then I just want to ask useful questions such as when did this patient last go to the doctor? When is their upcoming appointment what is their, you know, current pills that they have to take and whatnot. Basically we need GPT-4 to synthesize this long medical record for us and be able to answer very, you know, very direct and simple questions. So, the idea then is that inside Opaque, inside enclaves, we're gonna ingest this medical report, again, Opaque is not gonna see it, and we're going to take out the PII, we're gonna take usernames out, we're gonna take social security numbers, date of births, very personal information out, but then keep, replace that with some symbols. Then we're gonna send-
Jon Krohn: 00:40:39
So, so you could literally, you could instruct the LLM you could say, you know, we're replacing sensitive information with like codes. The codes will look like this. You can just like-
Raluca Ada Popa: 00:40:52
Absolutely,
Jon Krohn: 00:40:53
Yeah. You know, don't worry about the specific content in that code.
Raluca Ada Popa: 00:40:56
Absolutely. So, let's say we call this patient X instead of you know I know Alice. Alice is our usual victim in anything cryptography, Alice, Bob, and Malice. So, let's say we replace Alice with, you know, patient X and we send this new medical record with a lot of these PII replaced with code names and we instruct the LLM about these code names, and then the LLM is gonna answer Patient X last went to the doctor on this date and we're gonna back in Opaque, we're gonna change that X to be Alice, right? So, the user sees an answer as if everything had gone directed to GPT and to GPT-4, right? For the, from the user's perspective and experiences the same as if they're talking to GPT-4. But the difference is that GPT-4 doesn't get to see any PII data. They only see symbols and Opaque is gonna handle all of that in enclaves using confidential computing. So, Opaque itself doesn't see those PII, nobody sees that PII.
Jon Krohn: 00:42:02
That makes so much sense to me. It's interesting that when you first explained it, I thought it was gonna be like really hard to wrap my head around, but it's crystal clear. That example is so obvious like it's I can see exactly how that would work and I can imagine that it would work flawlessly all the time. So just making sure that it's something that happens smoothly and easily is the kind of thing that Opaque can provide.
Raluca Ada Popa: 00:42:24
Absolutely. And you know, it really comes with advantages for compliance. For example, HIPAA compliance. It's very clear about what you need to ensure and by switching to symbols you can ensure compliance with regulations like HIPAA.
Jon Krohn: 00:42:38
Cool. All right, well that was a fascinating number of topics for Raluca on generative AI and the privacy risks associated with using them, how we can overcome them. Let's switch gears a bit away from your commercial work at Opaque Systems to your academic work. So, you're a professor at Berkeley and there you co-founded and co-direct the RISELab as well as SkyLab. So, tell us about these research labs and the problems that you're tackling.
Raluca Ada Popa: 00:43:10
Absolutely. So, I'll tell you a little bit of how, I'll start by saying how the Berkeley labs are structured because I think it's a very unique model. So basically every five to six years, some of us faculty group around what we think are the big problems to solve in, you know, computing systems. And that gets a lab and a lab name. And there's a lot of industry companies that, that are supporting us and collaborating with us, giving us feedback about what are the major problems they're seeing, and then we incorporate that in our research. So, the RISELab had just come to an end five years ago. It was really focused a lot on cloud computing, security in the cloud, and machine learning in the cloud, but in one cloud. So, outcomes from that were prominent open-sources like MC2 for us, which that's confidential computing and is the basis of Opaque.
00:44:07
Also Ray, which you guys might have heard of and is the basis of any scale. So, now we regrouped again and to create a new lab, the SkyLab. And really the major problem we kept hearing a lot from industry and around is the fact that the world is becoming cross cloud. So, not just one cloud as we were in the RISE lab, but cross cloud. So, going from the cloud to the sky, that's why it's called the SkyLab. So there are these, you know, companies have pipe, they don't want to be locked into any one cloud and there's a lot of cross cloud applications these days. Lots of companies cross cloud, but also because different clouds have different strengths. For example, you know, Google has a TPUs, right? Or Azure has excellent confidential computing. So, different parts of your pipeline might make sense across different clouds.
00:44:54
So, what we try to do in the SkyLab is we're trying to enable that sky computation so that users and organizations can just run their pipelines in the sky without having to develop complex systems for which part goes on each cloud and trying to decide which is the most cost effective or the most performance effective way to optimize it. We do all of that. We basically schedule the, you know, each component in each, in wherever they should go. We optimize the cost, we optimize the performance, and we take care of the security cross cloud. So, that's basically the big research thrust and the work we're doing right now in this SkyLab at UC Berkeley. And obviously a big part of that are also, you know, generative AI supporting that in a very efficient way in the cloud.
Jon Krohn: 00:45:40
That is fascinating. So, first of all, I didn't know about that Berkeley lab structure, but it doesn't surprise me cause I've noticed when I am researching for guests at Berkeley or just there's so much exciting research that even before I was a podcast host, I was frequently looking at things that were going on in Berkeley, AI labs in particular. And it seemed to me obvious that there was a huge amount of collaboration. Like you see-
Raluca Ada Popa: 00:46:03
Absolutely!
Jon Krohn: 00:46:04
Different co-directors, prominent AI people working together. I didn't know that they had these kinds of I don't know if you would describe it as like a mission to like kind of accomplish over a certain number of years. Yeah?
Raluca Ada Popa: 00:46:16
Yeah. It's a mission. We are all united in the lab for a mission and we have, you know, for example, as SkyLab has professor's in machine learning and systems and in security, because to accomplish a mission, you need to cover all of these aspects. Yeah, I think, I think the lab structure is incredibly exciting the way Berkeley managed to make it work. It has a huge track record. If you think about it from our labs, Spark came out right in from the AMPLab, that was two labs ago. Or we have RISC-V five or RAID or all these big open-source projects. Again, the labs are about creating open-source that has an impact and, and changes the world. Honestly, that's one of the, that was the kind of sexiest thing that attracted me to Berkeley when you know, I finished my PhD and I had faculty offers, MIT, Stanford, Berkeley, and you know, a bunch others. I I just love the lab model, so, so very much it felt like the right synergy of industry collaboration, research really coming together.
Jon Krohn: 00:47:14
Yeah, fantastic. Seems like a great choice. I've never been to the campus, but it seems like a beautiful place, a very collegial environment.
Raluca Ada Popa: 00:47:22
You should come visit that.
Jon Krohn: 00:47:24
Nice. Yeah. And long overdue. Okay, so SkyLab it's focused on efficient, seamless compute pipelines across multiple clouds, taking them into the sky. I liked that clear example of like maybe you'll want to use Azure for confidential computing GCP for their TPUs. And so these different advantages of the different clouds, you can leverage all of those as an end user as you are developing your compute pipelines. So, that makes a huge amount of sense to me.
Raluca Ada Popa: 00:47:54
As well as smaller clouds. So, the nice thing about it is that there's so much, there's a lot of innovation in the smaller clouds, but they tend to be specialized at one thing, right? Or a very few things. They're not like Amazon AWS, which is so general, has so many offerings. It tends to be kind of like a one-stop shop. And this really encourages also the innovation in the smaller clouds and the smaller deployments.
Jon Krohn: 00:48:17
Cool. Yeah, I hadn't even thought of that. I don't. Yeah, it's, it's so bad that the big players are so prominent that I spend an embarrassingly small amount of time thinking about the smaller players in the space.
Raluca Ada Popa: 00:48:29
Exactly.
Jon Krohn: 00:48:30
Like in my mind it's like what's a small player? Like the Alibaba cloud is small.
Raluca Ada Popa: 00:48:36
Yes, yes, yes. IBM is smaller as well and-
Jon Krohn: 00:48:40
I guess so yeah, but even those are huge. I'm sure there's for specialized use cases, there must be so many. There's probably hundreds or thousands of different cloud providers. You had specific before we kinda transitioned over here, you said a really fascinating specific topic. What was that?
Raluca Ada Popa: 00:48:56
I would say that you know, some of the smaller clouds are also this mining pools like this big, you know, this [crosstalk 00:49:03], all this spare compute used to be specialized just for blockchain, but you know, now there's thoughts of repurposing them for, you know, LLM and that's also really exciting. Yeah, actually speaking of small clouds, there's some also very interesting computer resources that are available in the blockchain world. As you know, there used to be this proof-of-work where you mine and expand a lot of compute to, you know, win some reward. But now a lot of the big blockchains, for example, Ethereum, they move to something called proof-of-stake. We don't need to do that compute anymore. It's all based on something completely different based on how much stake you have in the system and things like this.
00:49:46
So now there's all that compute available and there's mining pools and so they're like little, you can think of them as little clouds and there's all that compute available and it's set up to, you know, work for some reward and for some goal. And then a very interesting research question we're looking at is how can you repurpose those for training LLMs or for inference of LLMs, right? Those nodes are extremely good that you give them a task and they get the reward for what they're doing. And in fact, in the blockchain space, there's also this very interesting structure called slashing where they put some money in themselves and if you find that they're doing something incorrect, you basically get that money kind of like an insurance. So, it's a really nice.
00:50:29
Now, you know, the challenge there that my students and I are thinking a lot about is because it's not a major cloud provider running this computation, but it's lots of small people operating it. How can you trust, right? How can you trust that they're doing this training correctly, training a model correctly, for example. And because you have a huge computation, you give them different pieces. How can you trust that? And yeah, for example, you could say let's, let's say multiple notes could run the same computation, but then again the cost will grow. Right? Now there's actually some estimation running estimations that if you have the compute run by these nodes, it's actually gonna be cheaper than the clouds because clouds charge premiums and GPUs are expensive in the cloud. You're actually gonna get it much more efficient, much more cheaply with this nodes. But the problem with this nodes is that it's a bit the trust, right? They're so different operators from all kinds of different places, how do you know that they're doing the right thing? So, that's a very big research challenge one that we're thinking about in the SkyLab.
Jon Krohn: 00:51:35
That sounds like a really useful thing to be doing. If there's all of these cycles and devices out there not being used they could be leveraged for, as we've already talked about earlier in this episode, the very intensive compute associated not just with training LLMs, but also with inference time. And so hopefully that can bring down the cost of running all these models and you know, it's thousands of startups have sprung up just in the last few months to take advantage of this and deploy it across industries. But in order for that to disseminate across all of the world and across all industries, as lots of people anticipate over the coming years, the coming decade, yeah, we're gonna need to be able to compute cheaply. So, this sounds like a, a really useful aim, a really useful mission of SkyLab.
Raluca Ada Popa: 00:52:26
Thank you.
Jon Krohn: 00:52:27
Let's talk about your feelings on closed-source versus open-source more generally. So some of the providers of these big LLMs are increasingly closed-sourced. So, one of the most cutting digs for OpenAI is to call them closed sorry for OpenAI is to call them ClosedAI. So you know, there was, they were set up with this mission of open sourcing AGI and now with GPT-4, their technical paper on it doesn't provide any details about the model, whereas at least the GPT-3 paper a couple of years ago did. So, yeah. What are these kinds of trade-offs? What's your perspective on it? You know, should, are there some things that tech companies should be keeping closed source to remain competitive? Or does this stifle innovation? Are there security implications? I imagine there's a lot of ways and a lot of thoughts, a lot of ways that you could go on this. A lot of thoughts that you have.
Raluca Ada Popa: 00:53:36
Absolutely. Very, very good question. It's something that we've been debating a lot in our own lab. Recently had the retreat with companies and we spent the whole panel just arguing about open-source versus closed-source. And my colleagues are pretty divided, I would say. I think I'm somewhere in the middle in the sense that I think that proprietary LLM models are still going to be extremely, extremely good just because it takes a lot of money and resources to do, you know, that all the data collection to have access to data to train LLM process. So, I think, you know why isn't Google, you know, for example, search open-source, right? And so I think there's still going to be, I think the proprietary ones are still going to have some advantage, but I also think that open-source LLMs are going to become, especially foundational models are going to become very, very good.
00:54:36
Also there's some thought on whether the government is gonna step in and want to help with open-sourcing quality LLM models just because there's more control then, right? And there's just more transparency about what's going on. It just comes back to trust and confidence because we all know that these LLMs can have a significant impact on society if they're not, you know, conducted with care. The other aspect that I think it's interesting is that I think there's gonna be not necessarily one LLM provider to cover them all. I think there's so many domains of knowledge LLMs of expertise, for example, I know there could be a really good one in a tech, could be one in a certain kind of insurance, right? So, I think there's gonna be companies coming up that will be specialized at a domain and they'll fine-tune the LLM, the foundational LLM that's probably open-source LLM fine-tune it for that domain of expertise so good. And they'll have access to confidential data for that domain and that will make their LLMs extremely good. And I think so I think that's, there's a place for proprietary elements there in like very domain-specific, and I think they'll be better than a general LLM provider will be at that specific task. They won't cover all the other tasks so they'll be better at that specific tasks. So, I think there will be some sort of decentralization of this LLM providers based on tasks and based on domain.
Jon Krohn: 00:56:09
Yeah. I think I, I'm a hundred percent in the same boat as you are Raluca. I see the same thing happening. There's just, there's too many groups out there, like at Berkeley working on Vicuna and open sourcing that there are too many clever people all over the world coming up with ways of being able to, especially as you say, for these domain-specific use cases you know, having smaller models, you know, with 3 billion or 7 billion parameters can run on a single GPU that can be quantized and run on a CPU. Very cheaply, very quickly that tackle these specific tasks and provided that the creator has maybe some proprietary data that they have from their niche that they've collected from lots of users. They can create a model that's better than the big proprietary ones from Anthropic, OpenAI, whatever.
Raluca Ada Popa: 00:57:08
Yeah. And I mean, speaking of open-source versus proprietary from the perspective of security and privacy, there is a clear advantage of open-source there, right? You can take an open-source model and run it in your on-prem, right? And you can send queries to that model at a propriatory and no problem, right? It's, it runs in on-prem. It's not like you're sending those queries to a remote LLM provider that can see now organization data, code from your organization. So, there's a clear, clear advantage of open-source models over proprietary ones when it comes to privacy and confidentiality.
00:57:46
There is though, I think another challenge when it comes to privacy and confidentiality, and this is something hard to address even when you have an open-source model that runs on-prem. So, you train, let's say this open-source model is fine-tuned over your organization data, right? But then what if Alice asks a question like, "what's the salary of the CEO?" You know? So, now if this LLM model is trained over the whole organization data, now the things that even in an organization, there's pockets of knowledge, right? There's access controllers, walls of information flow, and how can we teach an LLM model to respect those walls? And that's really challenging. I mean, you can tell them, "Hey, don't provide financial information to Alice" or not, but probably can't exhaust all those possibilities that probably somehow indirect queries could reveal. So, that's, I would say a very challenging question with open-source or, or proprietary. So, how do they keep track of this information walls?
Jon Krohn: 00:58:50
Yeah. This, a year ago, a really trendy term for describing solutions to this problem. Prior to LLMs kicking off, prior to ChatGPTs release, there was a lot of talk about this idea of a data mesh. And so we had Zhamak Dehghani in episode number 609 of this show. She coined the term data mesh and wrote the O'Reilly book on the topic. And so, yeah, it was a fascinating technical deep dive into these kinds of problems around when you have an organization, the HR people, they need to know this, this CEO's salary, they need to be able to pay the CEO's paycheck every month. But of course, that isn't necessarily gonna be information that everyone in the company should have. And so it's interesting, these, like even so with that data mesh concept, there was this idea of having data distributed across different departments, and you have rules as to who can get access to what. But with LLMs, as you're alluding to this is really hard because there's no longer these strict boundaries. It's just model weights distributed across the whole model.
Raluca Ada Popa: 01:00:09
Exactly. Exactly. They combine and exactly the weights are affected by really all the information. And we know, you know, in security and cryptography, that sometimes you can reveal information indirectly. You know, I, for example, if I know that Alice lives for work every day at, you know, 8:00 AM and whatnot, then I can infer the fact that okay, if she's leaving her home, then she's going for work, right? Or something like this. So, there's, it's indirect knowledge that you can infer, yeah, even without asking a question directly. So, how do you kind of prevent that?
Jon Krohn: 01:00:50
Yeah. So, is at this point, I guess that's just a question, there's no, like-
Raluca Ada Popa: 01:00:55
Exactly. It's, it's, it's a research question. There's, of course there's some attempts I've seen NVIDIA guardrails have some interesting proposal. There's some attempts that teaching the LLM to, you know, respect some rules, but it's not going to be precise. There's gonna be ways to get around that.
Jon Krohn: 01:01:14
Yeah. Interesting problem. I'm sure we'll see lots of attempts to solve that coming out over the years so that we don't have to have like a department-level LLM across the organization and the, you know, the engineering complexities of having all those running. It'd be nice to be able to have one big LLM that thoughtfully respects privacy in a way that we can trust. So, that'll be really interesting. Cool.
Raluca Ada Popa: 01:01:39
That would be very nice.
Jon Krohn: 01:01:42
Really great points across the board on open-source there, Raluca. Quickly, other than Opaque Systems, you have another startup PreVeil.
Raluca Ada Popa: 01:01:53
Yeah.
Jon Krohn: 01:01:53
And so this was a cybersecurity, or this is a cybersecurity company born out of research that you did back when you were at MIT, before you were at Berkeley. And so tell us a bit more about that. I know it's to do with end-to-end encryption.
Raluca Ada Popa: 01:02:08
Absolutely. Absolutely. So, PreVeil is a company now, I think, I think we can we can assume it, we can consider it graduated from the startup status. It's right now almost eight years old. It's in Boston and we have 800 enterprise customers. So, it's doing very well. We provide basically encryption for files and think of box, but with end-to-end encryption, not even box can see what your data is. Think of email there and chat all packaged in. And really the main attraction is the fact that we're compliant with some extremely stringent compliance. For example, from the defense sector, [inaudible 01:02:52], we make all of this compliance so much easier. We are integrated with enterprise modules of various sorts. For example, I don't know if you use, do you use ProtonMail, or Signal, or Telegram?
Jon Krohn: 01:03:08
I don't personally, but I'm aware of all of them. And I'm aware that, you know, there's, there's certainly people who are like, you've got to be using them and you can't be using anything else.
Raluca Ada Popa: 01:03:16
Yes, exactly. So, those are for users and for chat, think of PreVeil as the same thing for enterprises, and not just chat, but files sharing as well, like box, Dropbox, and email. And think of, you know, a stringent compliance. So, yeah, PreVeil is doing really well. We're very excited. We're voted by PC Magazine as the number one encryption email encryption for enterprises. So, that's, that's very exciting.
Jon Krohn: 01:03:43
Cool. So, yeah, so serial entrepreneur now, with two successful cybersecurity startups. Oh, well, companies under your belt.
Raluca Ada Popa: 01:03:52
Second one is a startup, second Opaque is I would say two years and a half in. So, it's probably still has the startup title.
Jon Krohn: 01:04:00
And yeah, so, what motivated you to be doing entrepreneurship while also maintaining a significant academic career?
Raluca Ada Popa: 01:04:09
Absolutely. I think that for me as a, you know, professor, I just want to improve. I want to impact the world and improve the world through, you know, better technology. And I think writing papers and solving important big problems that you then, you know publish in papers, I think is the first step, but it's only the first step because in order for you to actually change the world, you have to go and do it for real. I think that papers that sit on a shelf rarely get picked up by companies. They do get picked up sometimes, but it's much, much more rarely. What you, what I believe strongly in is doing it also. So, not just, you know, saying how, and, but actually going and doing it and making it frictionless to use and really demonstrating its full capabilities. So, I really, really enjoy this whole tech transfer.
01:04:58
And I think, you know, Berkeley, a lot of my colleagues at Berkeley are quite like-minded, and Berkeley is just such an amazing university that actually encourages us and supports us in doing this tech transfer in having both an industry hat and an academic hat. It, it's actually funny, you know, like in there's this joke going around that, you know how in computer science programs the top are number one, are on the same, you know, position, Berkeley, MIT, and Stanford, but people say that MIT's too far from Silicon Valley, Stanford is too close, and Berkeley's just right so, so like, and I think that's, that's, that's true. I think that right distance from Silicon Valley really enables us to make our research real and to inform our research from what companies really need and problems they face, and then also get that adoption through tech transfer.
Jon Krohn: 01:05:52
Yeah. A colleague of yours, whom I know you know very well, Pieter Abbeel, who was in episode number 503 of this show, an interesting thing that he said about the difference between academia versus industry. So he's in robotics, he hosts the Robot Brains podcast, and you've been a guest on that show. So it's, this is like a robotic specific thing, but I suspect, and, and you can maybe elucidate on this a little bit more, that it is probably similar in your space as well, that in academia, Peter was concerned with being able to show off completely new capabilities. Like I think the example that he gave in the episode was something like the robot hand that was doing aro a Rubik's cube. It's like, that's a really cool thing to be able to show academically. It, you know, you're, you're clearly demonstrating, you're really pushing the boat out in what robots can do. But, that's not an industry application.
Raluca Ada Popa: 01:06:49
Right, right, right.
Jon Krohn: 01:06:50
For industry, you're not interested in being able to get a video once of a Rubik's cube being solved. You're interested in extremely low error rates. For example, that you need, the robots don't necessarily need to be doing the craziest things you can imagine, but what they do, do they need to have like 99.99999% accuracy on doing it? So, the, like, there's these different challenges with getting an academic paper published and making a big splash with that paper relative to building a successful product in the industry.
Raluca Ada Popa: 01:07:25
Absolutely, yes. I completely agree with that point. It's, you know, you develop the research, you develop technology, and then you say, oh, let me do tech transfer. And you think, okay, the first, you know, the very first part obviously of doing tech transfer is building a very robust product. Absolutely robust, right? That it's not gonna fail for the user. It's not gonna, it is gonna, it's super tight when it comes to security. That's first and foremost. But it's actually not enough. Then you realize there's this whole other big, you know, slice of the pie, which is about, okay, but why would the user use this as a priority? Like, why would the business buy this product as a priority over all the other things they need to do? Why is this a huge pain point? So, and one thing that I realized in security is that if you tell them, "Hey, you're gonna have so much security, you're gonna be so much less exposed to hackers, and it's just so much stronger" than they're gonna say, "that's really wonderful, but, and it's something I would like to have, but it's not my number one priority right now." I need to figure out how to be compliant. You know? Or I have to figure out how do I, you know, even acquire this data set that I need.
01:08:36
So, that's another point of part of, you know, if you want to do this technology for real and you want it to be adopted, it's not enough to do the research. It's not even enough to build an amazingly good system. You also have to demonstrate to people how it beats their other priorities. For example, one thing I found that it helps with security. It's not so much convincing them about a stronger security, because they believe it, but it's not a priority. It's the fact that hey, you can actually be more easily compliant. Like you have to be compliant, it's gonna be cheaper to be compliant, it's gonna be faster to be compliant because this technology is just better at security. So, it's easier to be compliant with it. Or "hey, you want to get this other data set that they cannot share with you because it's confidential. That's what confidential computing and Opaque enables you to. Now you can acquire data set, now you can collaborate." So, it's about enabling this functionality. It's not so much about just strengthening security.
Jon Krohn: 01:09:29
Yeah. Really great case study there and shows yeah, the impact that you can make, the value that you can drive as an entrepreneur beyond just having your academic career. It's something that, like, candidly looking back you know, I, there's parts of me that I'm like, wow, I wish I did stay in academia after my PhD. Like there's, I can see so many when there's people like you or Peter, the amazing things that you're doing at Berkeley, I'm like to have that kind of community you know, just to be able to go in and tackle these problems together instead of like largely on my own as an entrepreneur I, the appeal yeah, is huge. So, I'm-
Raluca Ada Popa: 01:10:10
It's a lot of fun.
Jon Krohn: 01:10:11
Super jealous.
Raluca Ada Popa: 01:10:12
It's, it's, it's a lot of fun. It's a lot of fun. Probably the one thing you won't be jealous of is the lack of sleep. You know, in making all of this happen. But it's so much fun. In fact, it's, it's just so stimulating and exciting to have both the academic and industry part that what, what's the point of sleep.
Jon Krohn: 01:10:36
Very good. Yeah. So yeah, in addition to making an impact as an entrepreneur, you also advocate strongly, you make a big impact promoting diversity and equity in the tech industry through initiatives such as the DARE program - D A R E, DARE. Do you want to tell us more about why this is important to you?
Raluca Ada Popa: 01:10:58
Absolutely. Yeah. So, DARE stands for diversifying access to research and engineering. And the idea was that I was trying to debug and understand why is it that in research there aren't, you know, enough minorities in research and computer science. And I was really trying to get to, to the bottom of it. So, I was talking to a bunch of minority groups at Berkeley. They were telling me, oh, it's so daunting to get started on research with the professor. There's, you know, you know, for example, in my security class that I teach every year there's 700 students, right? And I'm only one, right? I have TAs, but I'm one professor. So, if they want to do research with a professor, they have to kind of compete with 700 students. So, was what was shocking was that some minority students were incredibly strong. They had exceptional performance in my class, in other classes, and they weren't even attempting to contact the professor because they thought, oh, there's no way I could get this.
01:11:55
And so it was a confidence issue and they would just, you know, go for industry internship with, they could have also that exposure to research and all that that teaches you. So, one thing that we did is we create, what we created there the idea is that the professors contact the students and the grad students contact the students. So, it eliminates this barrier. And how do we do that? Well, we have a, we have all the users' data, the students' data on, you know, classes they took. They upload their CV to a system. They don't have to be shy about talking to professors cause they're just uploading their information to our system. And we have a clever matching algorithm that matches, we know from each professor what kind of student they're looking for, what profile, and we match them with a professor, and then we show them to these professors some extremely strong students.
01:12:42
And the professor and, and their grad students are many times are like, wow, I really want to work with this and this and this. And they send an email to the student, "Hey, do you want to chat? Do you want to contact?" And it's, it removes that barrier for the student to approach the professors and approach the research and the professors come to them. And we've seen a lot of success since the program started. We had more than a hundred students do research and at least I know roughly half of them are minorities, either low socioeconomic background or ethnic minorities or, you know, gender minorities. And so that really helped. And it's nice because it equalizes, it's no longer based on who's intimidated or not. It's really based on, you know, are you fit for this research. And also based on professors choosing to have a diverse and healthy group, research group.
Jon Krohn: 01:13:33
Fantastic. Yeah. And it's great that you were able to identify like this specific cause like a contributing cause to this lack of diversity issue and resolve it. So, like that's another example of not just like doing the research on it, but also making that real-world impact.
Raluca Ada Popa: 01:13:54
Thank you. And an exciting actually a fun story here one that I'm very happy about is, so my one of my PhD students actually [inaudible 01:14:04] my first PhD student. She is now professor at Carnegie Mellon of computer science, computer security. And she is also right now implementing the same program there. So, my students are taking the program that they're programming, kind of implementing it at a few other universities. That's very exciting.
Jon Krohn: 01:14:23
It's so good. There's, I mean, so many reasons why this is so essential. You know, obviously everybody should, everybody who merits it should be getting a look and should be getting their chance and be able to have the same kinds of yeah amazing life experiences as everyone else. But then also selfishly in terms of like a world perspective with all of these open-source tools that are available to us, all of these inexpensive or free educational resources that are out there, there's never been a better time to be alive for somebody to be able to, all you need is an internet connection and you can be thinking of product ideas, you can be creating companies, you can be making a real big social impact, but that impact is only maximized if everyone feels like they can do it.
Raluca Ada Popa: 01:15:15
Absolutely.
Jon Krohn: 01:15:16
And if there's only a subset of the world's population that feels like they could do it, then obviously the total social impact is smaller.
Raluca Ada Popa: 01:15:24
Absolutely. Absolutely.
Jon Krohn: 01:15:26
So, yeah, cool that it's propagating and it, it'll probably keep going and keep going.
Raluca Ada Popa: 01:15:30
It's gonna be interesting to think how to expand these to areas outside of computer science where research happens differently. And I personally don't really have a clue of what, you know, makes a lot of sense there. This was very, you know, custom designed for engineering for computer science. Really.
Jon Krohn: 01:15:49
Yeah. But some of these smart people are gonna figure it out, for sure. Well, thank you Raluca for taking so much time with us. You've been so generous with your time, especially given the complete lack of sleep that you have with your entrepreneurialism and your cutting-edge academic work. Before I let you go, do you have a book recommendation for us?
Raluca Ada Popa: 01:16:13
I guess you really see what the nerd I am in answering this question. I really think that the, if you want to keep track of the biggest innovations in security and privacy for AI analytics, machine learning in general, I would really say that Conference Proceedings of, you know, actually Police Symposium of Security and Privacy is what I recommend. Because the advantage of that, it's a move that keeps, it's, it's a book that keeps giving, you know, every year you have new research, cutting edge research in the space that you can keep track of, keep track of. Yeah. So, very nerdy answer, but I that's, that's what they think is.
Jon Krohn: 01:17:01
As it's in the three years or so that I've been host of this show, I think it's the first time we've had conference proceedings recommended as the book, but I love it. It's really well suited to you and it also, it makes a lot of sense. You're right. Annually you get a whole new book nicely updated for you. That's awesome.
Raluca Ada Popa: 01:17:19
Because I also think that people's views, any person you take, their views are pretty skewed of this world and there's a lot they're missing. They have some pretty good insight, but there's also a lot that they're missing even like the smartest people in the world. And whereas a conference proceeding is kind of the voice of many. So I tend to believe it's more of a reliable source of information and yeah, I know a very nerdy answer, but yeah.
Jon Krohn: 01:17:48
All right. And Raluca very last thing. Obviously, you are a brilliant communicator. Your answers to all of my questions today were so crisp and they you know, they were, they were condensed perfectly but also had great clear examples that made them easy to understand. So, I'm sure there's a lot of listeners that would love to follow your thoughts after this episode. How can they best do that?
Raluca Ada Popa: 01:18:17
Thank you. I would say that probably follow me on Twitter. My name Raluca Ada Popa is probably the easiest. I tend to announce some of the bigger initiatives or work there, both academia and industry. I also have a website with my publications. I think the Twitter covers both research and industry endeavors.
Jon Krohn: 01:18:41
All right, thank you so much Raluca. This has been an awesome episode and maybe in a couple of years we can catch up again and see how things are coming along.
Raluca Ada Popa: 01:18:50
Sounds great, thank you so much.
Jon Krohn: 01:18:57
Well that was a slam dunk. In today's episode, Raluca filled us in on how hardware enclaves on CPUs and now recently GPUs allow data to be used for analytics inference or model training without anyone including the model owner being able to see the unencrypted data. She talked about how you can even call a commercial generative AI providers APIs such as those provided by OpenAI or train an open-source LLM without again, anyone accessing data you'd like to ensure is kept private. She talked about how her SkyLab is enabling efficient, seamless compute pipelines across multiple clouds such as Azure for confidential computing, GCP for their TPUs, and smaller cloud providers for their specialized capabilities. She talked about how open-source LLMs, especially domain-specific ones, will likely be competitive with commercial LLMs on specific tasks and how open-source AI development may lead to better security.
01:19:50
As always, you can get all the show notes including the transcript for this episode, the video recording, any materials mentioned on the show, the URLs for Raluca's social media profiles, as well as my own social media profiles at superdatascience.com/701. That's superdatascience.com/701. All right. And I always assumed that like me, people mostly listen to podcasts that they're subscribed to, but I recently discovered that's statistically not the case. So, if you're not already subscribed to this show, please do to be sure not to miss any episodes of this twice-weekly program. Of course, I also greatly appreciate it if you rate the show on your favorite podcasting app or give the video a thumbs up on the SuperDataScience YouTube channel. And if you have friends or colleagues that would love the show, let them know.
01:20:36
All right, thanks to my colleagues at Nebula for supporting me while I create content like this SuperDataScience episode for you. And thanks of course to Ivana, Mario, Natalie, Serg, Sylvia, Zara, and Kirill on the SuperDataScience team for producing another awesome episode for us today. For enabling that super team to create this free podcast for you we are of course, deeply grateful to our sponsors. Please consider supporting the show by checking out our sponsors' links, which you can find in the show notes.
01:21:03
And finally, thanks of course to you for listening. I'm so grateful to have you tuning in and I hope I can continue to make episodes you love for years and years to come. Well, until next time, my friend, keep on rocking it out there and I'm looking forward to enjoying another round of the SuperDataScience podcast with you very soon.
Show all
arrow_downward