Trailblazing Berkeley professor and tech entrepreneur Dawn Song joins Jon Krohn for an in-depth talk into her groundbreaking research on “Responsible Decentralized Intelligence.”
Show All
Thanks to our Sponsors:
Interested in sponsoring a SuperDataScience Podcast episode? Visit JonKrohn.com/podcast for sponsorship information.
About Dawn Song
Dawn Song is a Professor in the Department of Electrical Engineering and Computer Science at UC Berkeley. She is the recipient of various awards including the MacArthur Fellowship, the Guggenheim Fellowship, the NSF CAREER Award, the Alfred P. Sloan Research Fellowship, the MIT Technology Review TR-35 Award, the George Tallman Ladd Research Award, the Okawa Foundation Research Award, the Li Ka Shing Foundation Women in Science Distinguished Lecture Series Award, the Faculty Research Award from IBM, Google and other major tech companies, and Best Paper Awards from top conferences in Computer Security and Deep Learning. She obtained her Ph.D. degree from UC Berkeley. Prior to joining UC Berkeley as a faculty, she was a faculty at Carnegie Mellon University from 2002 to 2007.
Overview
As the popularity of Web3 technologies grows, so does the misuse of these groundbreaking developments. As the co-director of The Berkeley Center for Responsible, Decentralized Intelligence, Dawn Song ensures that these technologies are being leveraged and developed responsibly. Some themes that remain top-of-mind for her team include privacy preservation, regulatory compliance, fairness, ethics, inclusiveness, diversity and more.
In the future, Dawn predicts a shift into a decentralized society that gives more control to individual users and allows people to make better decisions for their interests. Moreover, decentralized systems give us more assurance that the system will work despite outside threats. Together, these benefits ensure the future viability of such technologies.
When it comes to data scientists and how they can prepare for this decentralized future, Dawn introduced us to her Decentralized Data Science Platform, which is part of her team’s research at the RDI centre. She aims to help data scientists and practitioners develop data science applications emphasizing security.
As the founder of Oasis Labs, her team collaborates with Meta A.I. on the Responsible Data Economy project. This initiative sees the intersection of homomorphic encryption, differential privacy, and multi-party computation to ensure efficient solutions and privacy protection. Oasis Labs are also in the process of commercializing, PrivateSQL, a first-of-its-kind end-to-end differentially private relational database system. Dawn explains that PrivateSQL acts as a layer between the data analyst and the backend database.
Dawn’s reflections on her work are profound and technical as she dives even further into topics like homomorphic encryption and multi-party computation when explaining her research into responsible AI. Tune in to this episode for more from Dawn.
In this episode you will learn:
- What is decentralized intelligence? [3:46]
- Dawn’s Responsible Data Economy collaboration with Meta AI [11:31]
- How homomorphic encryption, differential privacy, and multi-party computation can work together [16:22]
- How PrivateSQL makes differential privacy easy to use [22:54]
- The relationship between deep learning and federated learning [37:55]
- What is a responsible data economy [42:13]
Items mentioned in this podcast:
- Iterative
- Oasis Labs
- Differentially Private Fractional Frequency Moments Estimation with Polylogarithmic Space
- Model-Contrastive Federated Learning
- CoLearn: Open-Source Decentralized Data Science Platform
- UC Berkeley Center for Responsible Decentralized Intelligence
- PrivateSQL
- Responsible Data Economy collaboration with Meta A.I.
- SuperDataScience Podcast Survey
- Jon Krohn’s Podcast Page
Follow Dawn:
Podcast Transcript
Jon Krohn: 00:00
This is episode number 633 with Dawn Song, professor at Berkeley and co-founder of Oasis Labs. This episode is brought to you by Iterative, your Mission Control Center for Machine Learning.
00:16
Welcome to the Super Data Science Podcast, the most listened-to podcast in the Data Science industry. Each week we bring you inspiring people and ideas to help you build a successful career in Data Science. I’m your host, Jon Krohn. Thanks for joining me today. And now let’s make the complex, simple.
00:47
Welcome back to the Super Data Science Podcast. We’ve got a landmark episode for you today, which was filmed live on the keynote stage of the Open Data Science Conference West in San Francisco, more commonly called ODSC West and one of the biggest Data Science conferences in the world. Our exceptional guest for this landmark episode is Professor Dawn Song, a trailblazing Berkeley professor and tech entrepreneur who wins award after award for her work on responsible decentralized intelligence, including a MacArthur Fellowship, which is commonly referred to as the Genius Grant.
01:23
Today’s episode is a deep technical one that will appeal primarily to practitioners like data scientists, but it does have takeaway points that will allow any interested learner to become abreast of the massive emerging potential of decentralized intelligence. In this episode, professor Song details what decentralized intelligence is and how it relates to Machine Learning, particularly deep learning to other emerging technologies like The Blockchain, differential privacy, federated learning, and homomorphic encryption. She talks about what a responsible data economy would look like with specific real world examples from her applications of her research to industry, and she provides us with specific resources that she has developed to allow data scientists and software developers to easily develop and deploy privacy preserving, Machine Learning applications. All right, you ready for this deeply immersive, live filmed episode? Let’s go.
02:21
All right. Welcome to Super Data Science Live on stage at the Open Data Science Conference West, ODSC West in San Francisco. Let’s get a crowd cheer.
02:39
Nice. I’m your host Jon Krohn and I’m joined today by a very distinguished guest. We have Professor Dawn Song. She leads trailblazing research at the intersection of deep learning and decentralized systems like The Blockchain. She’s been a professor in the Computer Science division of UC, Berkeley for 15 years, which as many of you know, is a number one university in the US overall and also number one for the Computer Science Grad Program and for The Blockchain. So it seems like you’re in the right place for that. Maybe you’re even responsible for that.
03:14 At Berkeley, professor Song co-directs a new campus-wide Blockchain and Web3 center called, The Center for Decentralized Intelligence, and she’s part of the illustrious Berkeley AI Research Lab, BAIR. She’s authored over 300 papers leading to over 80,000 citations. She’s won countless awards including the Genius Grant MacArthur Fellowship, and she’s separately the founder of Oasis Labs, a data privacy startup that’s raised over $45 million in capital. So let’s start, Dawn, with the new Center for Decentralized Intelligence that you co-direct. Tell us more about the center and why it’s called a Center for Decentralized Intelligence. Maybe tell us what decentralized intelligence is.
Dawn Song: 03:59
Great. First, thanks all for having me here. Let me first talk a little bit about the new Berkeley campus-wide Blockchain Web3 Center, called at the Berkeley Center, Responsible Decentralized Intelligence. So the center really, the goal or the mission statement is to advance the science and technology of Web3 decentralization and decentralized intelligence and to make it universally accessible and to help promote a responsible digital economy. And the center is called Responsible Decentralized Intelligence, because the center actually focuses exactly on three key aspects, responsible decentralizing intelligence. Okay, let me talk a little bit more, go into a little bit more detail for this. So first responsible, as we all know, the technology field is moving really fast and we are developing really exciting new technologies all the time. And then people have been talking about the large foundation models, stable diffusion and so on.
05:05
And also in the Web3 world, we are developing really exciting new technologies as well, including Blockchain. And people have heard a lot about NFTs, Metaverse and so on. Meta even changed the name from Facebook the Meta. But however we all know that these technologies, it is just like what people say, with the great power, you also have great responsibilities. And the technology can be misused and we have seen already in the real world, these new technologies are being misused both in terms of, for example, people are now continuing to worry about fake videos, fake news and so on. In the Web3 world, of course there’s a lot of scams and there’s a lot of attacks and so on. So one key, especially as we move forward in this technology-driven advancement is that we need to really make sure that the technology is being used in a responsible manner.
06:05
And when I say responsible, there are many different aspects of being responsible. Including, we wanted to be in a privacy preserving, we wanted to be regulatory compliance and it needs to be fair, has good ethics, and also we want to enable leveling playing fields to enable innovation and support diversity, inclusiveness. So there’s a long list of the different aspects of being responsible. And the number one goal for the center is to ensure that as we develop these new technologies, we need to develop new approaches and find new solutions to ensure the technologies are used in a responsible manner. So that’s why the center starts with, responsible, being the first word. And the second one is decentralized. So in particular the center focuses on, as I mentioned, decentralization technologies. The center aims to develop various new advancements in decentralization technologies. And this is broadly ranging including Blockchain, Web3, decentralized Data Science, decentralized intelligence, because, so the goal or the key of decentralization is that we can build systems without relying on central trust.
07:24
We can build systems that actually can support, that actually can work in this decentralized manner, essentially build on decentralized trust or sometimes also people call it trustless. And given that I also have done research in computer security for a really long time, over two decades and we are known essentially I think building systems that doesn’t rely on centralized trust, actually is the most secure way of building secure systems. And hence, I do strongly believe that in the future we are going to see more and more of these decentralization technologies, decentralized systems, because they are more robust, you can build it in a more secure way.
Jon Krohn: 08:07
Right, so decentralized systems allow us to have more trust in the system and in a way that doesn’t rely on a central administrator.
Dawn Song: 08:15
That’s a really good point, yes. It allows you to have more assurance that the system will work, even when there are various either attacks or certain central parties or certain parties are compromised and so on. Because you realize on decentralized trust so that you can, in the end have more assurances of the properties you want to ensure and enforce for the system.
Jon Krohn: 08:44
Nice. And so you mentioned that you think this is going to develop, that we’re going to have more of this in the coming years. What’s that going to look like, maybe over the next five years and then over the coming decades? How do you see decentralized intelligence evolving?
Dawn Song: 08:57
Right. That’s a really good question. And also, yes, so the third key aspect is intelligence. And for intelligence also it’s really broad ranging. Here we are at the Data Science conference is intelligence of course, part of it is coming from data and ultimately it’s about making decisions, how we can as individuals and as groups and as society, how we can make the best decisions. And also we want to, in the future, we are going to see more autonomous agents with AI, Machine Learning and so on. So the third key aspect of the center is how we can enable these autonomous agents actually in a decentralized manner as well. So for example, going into the future, we are going to have more, for example, self-driving car technologies and already we have a lot of smart agents deployed at home and other places and they are going to continue to be smarter and smarter and they’re going to make more and more automated decisions on behalf of their users and companies and so on.
10:05
So this is all in the broad room of intelligence and I do strongly believe that in the future we are going into a decentralized intelligence future, where it’s not all this intelligence is controlled by centralized entities, are the top tech companies and so on. Instead, I do strongly believe that first, we need to have decentralized intelligence, we need to have these different autonomous agents or even personalized assistance, virtual assistance and so on, that are more controlled by individual users to essentially work on the best of their interests, working on the behalf of individuals and also ultimately these decentralized intelligence we hope can help make much better decisions overall, make more fair decisions that actually can take into account different entities and different users’ interests and provide more privacy preserving solutions as well.
Jon Krohn: 11:11
Nice. So our audience here, many of them are probably expert with the intelligence part of responsible decentralized intelligence and probably have lots of ways that they are thinking of creating more automation in our world. So what can our audience be doing? What can data scientists be doing to be thinking about having decentralization in their applications?
Dawn Song: 11:38
Oh, that’s a really good question. So yes, actually as part of the research projects that we have been doing in the Berkeley RDI Center is that we’re actually developing a new platform called decentralized Data Science platform.
Jon Krohn: 11:53
Perfect.
Dawn Song: 11:54
So, that actually, the goal is to help data scientists and practitioners in Data Science and Machine Learning in the real world to actually make it easier for them to develop decentralized Data Science applications with strong security and privacy guarantees. So I can tell a little bit of that story.
Jon Krohn: 12:19
Yeah, dig into it.
Dawn Song: 12:19
How this came about. So this I just started from a recent partnership that Oasis Labs did with Meta. So in this partnership Oasis Labs worked together with Meta, developed new cutting edge technologies in privacy preserving AI and Machine Learning area. So essentially in this partnership, we developed the cutting edge privacy technologies to help enable AI model fairness assessment. So for example, with Instagram and other applications and Meta, data users get recommendations served from AI Machine Learning models, then of course, there’s huge questions about whether these Machine Learning models are actually fair, whether they have certain biases and so on. And many of you may have seen a lot of discussions about how important it is to ensure that these Machine Learning models are fair. So then it’s a really important question for Meta to figure out how to measure this fairness for their AI models. However, in order to do that, it requires them essentially for this assessment, for this amount of fair fairness assessment, it requires essentially knowing essentially certain sensitive attributes of users.
13:50
For example, if you want to measure fairness across genders, then you need to know for a certain user, the gender of the user and race and so on.
Jon Krohn: 14:03
And we don’t necessarily want those attributes to be sent to Meta.
Dawn Song: 14:06
Exactly.
Jon Krohn: 14:06
Right.
Dawn Song: 14:10
Because in that case then Meta will be knowing too much about the user. Then there’s privacy concerns.
Jon Krohn: 14:15
This episode is brought to you by Iterative, the open source company behind DVC, one of the most popular data and experiment versioning solutions out there. Iterative is on a mission to offer Machine Learning teams the best developer experience with solutions across model registry, experimentation, CI/CD training automation, and the first unstructured data catalog in query language built for Machine Learning. Find out more at www.iterative.ai. That’s www.I-T-E-R-A-T-I-V-E.ai.
Dawn Song: 14:53
So then in this case, so what Meta did in collaboration with Oasis Labs and a few others, so first Instagram users can opt into a survey run by an independent survey operator, and then in that case, if the user opt into the survey, the user can fill in a form providing information about their gender, race and other types of sensitive attributes. And again, as we just discussed, it’s of critical importance that this information is not shared directly with Meta. However, this information is crucial to compute this AI model fairness assessment. So then what we developed is actually a combination of cutting edge privacy technologies, including secure multiparty computation and homomorphic encryption, and also we use some zero knowledge proof to ensure that the actual computation process actually is proper. For example, the data is within a certain valid range and so on, and also differential privacy. So essentially what I just mentioned, these are actually the key components of modern secure computing, privacy computing technologies. Essentially there are different technologies, but also essentially the similar goals of ensuring privacy protection for computation.
Jon Krohn: 16:20
Yeah, maybe we could dig into some of those a little bit more. What is homomorphic encryption?
Dawn Song: 16:27
Okay, that’s a very good question. So homomorphic encryption is a type of encryption with a special type of properties. It’s actually really beautiful and very elegant. So basically the idea is, normally with a normal encryption algorithm, you take a plain-text/x, you generate an encryption, a software text of EFX, E here just stands for encryption function, the EFX and for normal encryption, then you get take a plain-text/x, you get EFX and you take plain-text/y, you get EFY. But for normal encryption, EFX and EFY in this case, they don’t really have any particular relations and so on. But with the homomorphic encryption what happens is that it enables a certain relationship, a certain mapping between the plain-text and the ciphertext, in the sense that, for example, if I give you plain-text/x and the ciphertext EFX and plain-text/EFY, EFY then in this case, see additively homomorphic if then now I can compute x plus what the encryption of X plus Y, which is E of X plus Y from EFX and EFY, in this case you can say it’s addition of EFX and EFY in the separate text domain.
Jon Krohn: 17:50
Okay. Okay.
Dawn Song: 17:50
So there’s additive homomorphic encryption and similarly you can do multiplicative homomorphic encryption as well, if the multiplication relation shape holds as well. And then if the encryption algorithm satisfies both or enables both, then we call it fully homomorphic encryption.
Jon Krohn: 18:08
And then so, what does that enable for us? What does that encryption enable us relative to maybe other kinds of encryption?
Dawn Song: 18:14
That’s a great question. So with this capability, what you can see then is quite amazing. So basically if I’m a user, I’m holding a plain-text/x, and then I can generate the encrypted version, the ciphertext EFX and if you add the survey program and then basically from my encrypted data, then you can do all sorts of computation.
Jon Krohn: 18:38
I see, I see.
Dawn Song: 18:38
So then essentially you can compute private data without seeing it.
Jon Krohn: 18:42
Got it.
Dawn Song: 18:42
So you can enable computation over encrypted data.
Jon Krohn: 18:46
So to keep going with the Meta example, you can have private information that users have about sociodemographic characteristics and that can stay private to them by using homomorphic encryption, so then just the homomorphically encrypted information gets sent to say Meta and they can perform some computation on it and then give some API response back without ever having actually seen the private data.
Dawn Song: 19:14
Yes, exactly. So in fact, actually there has been research done in the space, where you can use this type of technology. And so for example, if you want to evaluate over a Machine Learning model, but without allowing the model to actually see the plain-text of the inputs, you can send the encrypted version of the inputs and then by using homomorphic encryption, you can then basically evaluate the model, compute the inference on this encrypted data and then in the end send the encrypted inference results to the end user and the end user can decrypt it and learn the actual inference result.
Jon Krohn: 19:51
Nice.
Dawn Song: 19:52
So [inaudible 00:19:54] to this. So that’s why in our work with Meta we have to combine all these different technologies is because fully homomorphic encryption today, it’s a great technology, it’s really beautiful, but it’s still very expensive, especially when you want to compute over really large amount of data.
Jon Krohn: 20:13
I see.
Dawn Song: 20:14
And for example, the example that I just gave, if you want to evaluate over this, over encrypted data for this Machine Learning model even just a small model, it can take you actually very long time to do this. So then at the end, model fairness assessments work, our goal is to actually be able to do this computation over convenience of users. And so we need a more practical solution, a more efficient solution. So that’s why in this case we also combine secure multiparty computation, so with secure multiparty computation, in this case you have-
Jon Krohn: 20:51
Secure-
Dawn Song: 20:53
Multiparty computation.
Jon Krohn: 20:54
Multiparty computation.
Dawn Song: 20:55
Right. So with full homomorphic encryption you are the server and it’s only one entity who can do this computation over encrypted data. But however would secure multiparty computation, you do secret sharing across multiple parties so that each party actually doesn’t know the actual secrets, but then altogether in collaboration they can actually compute the results. And then it’s also a form of computation over encrypted data. But in this case, the trust model is different. So in the fully homomorphic encryption case, the trust model is that we don’t need to actually trust anyone, because as a server you don’t actually see any data, you only see certain text.
Jon Krohn: 21:44
Right.
Dawn Song: 21:45
But in the second multiparty computation setting, we need to make the assumption, the trust assumption that the attacker can only compromise at most, certain thresholds.
Jon Krohn: 21:58
I see.
Dawn Song: 21:58
Of the parties involved.
Jon Krohn: 22:00
I see, I see. So we can assume that an attacker could only compromise certain parties in this multiparty system. And so even if an attacker got access, they wouldn’t be able to combine the information together.
Dawn Song: 22:13
If they only have access to at most certain thresholds.
Jon Krohn: 22:18
Right.
Dawn Song: 22:18
And number of servers. But the advantage of this approach is that the computation is cheaper than in certain cases, than in certain setting than fully homomorphic encryption. So that’s why in our work with Meta, we actually combine these different technologies to ensure that we have the most efficient solution and at the same time have the better privacy protection as well.
Jon Krohn: 22:45
Nice. Another term that you mentioned earlier when talking about this project with Meta was the idea of differential privacy. So what is differential privacy and how is the PrivateSQL that you’ve developed going to enable our audience here to make differential privacy SQL queries more easily?
Dawn Song: 23:07
Okay, great. That’s a great question. So when we talk about privacy computing, actually there are several different aspects of privacy that we need to pay attention to. So the first, when I talk about fully homomorphic encryption and also secure multiparty computation, and also there’s another type of approach using secure hardware, secure enclaves. So all this technology is the goal is to protect the computation process from leaking sensitive information. So basically the goal is to compute over encrypted data, but using different types of technologies. But another important aspect is, so for example, in this case, when some of that data’s a computation, the computation is based on the original sensitive input. So then we also need to ensure that the computation outputs doesn’t leak sensitive information about the original input. So for example, some of our earlier work actually in this space as it relates to these large language models, so we showed that with this large language models and also it’s not just unique to large language models, so I’m just using that as an example.
Jon Krohn: 24:18
Sure.
Dawn Song: 24:18
So in general, these models, deep learning models, they have huge capacity, so there’s a natural question whether they actually remember training data and if they do, whether attackers can actually extract the sensitive information from the original training data by just occurring this model without even knowing the architecture or the parameters, these details of the model.
Jon Krohn: 24:43
So GPT-3 could theoretically, if trained on private data, it could be memorizing those data and then you could potentially write queries that would allow you to extract private information from a large language model like that?
Dawn Song: 24:56
Yes, absolutely. That’s actually exactly what we did. So we did a number of experiments. So one interesting experiment with this, it was a smaller language model where we trained the language model over a dataset called the ENRON dataset, which naturally contains-
Jon Krohn: 25:16
The ENRON dataset.
Dawn Song: 25:19
Yes. The ENRON email dataset.
Jon Krohn: 25:20
Okay.
Dawn Song: 25:22
We don’t have time to go into that, but it relates to the ENRON-
Jon Krohn: 25:24
The company, yeah.
Dawn Song: 25:26
The earlier, yeah. Yes. That was a case. So, that’s how that data was made available. And the ENRON email datasets naturally contained users’ social security numbers and credit card numbers. So our work shows that by training, when you train a language model over this dataset an attacker by devising some new attacks, it can, from just screening the language model, without knowing the details of the language model can automatically extract the social security numbers and credit card numbers that were in the original training dataset.
Jon Krohn: 26:04
Wow.
Dawn Song: 26:04
So this is an example illustrating that as we train these large models, it’s really important to pay attention to user’s data privacy as well. And then later on we extend the work on large language models as well, including GPT-2 and we are actually looking at some new models including the markers in the category of stable diffusion and so on. So all this illustrating that as we compute over user sensitive data from the [inaudible 00:26:40] in this case, other is the inference from a Machine Learning model or it can be other analytics and so on. So for example, if you are computing analytics over user’s data, then it’s important that the computation output doesn’t leak users’ sensitive inputs. So then what’s the solution to this? So today actually the gold standard is to make your algorithms differentially private.
27:10
So differential privacy is a formal notion of privacy. At high level what it says is, the reason’s called differential privacy is, so for example, I have a dataset and I have a neighboring dataset with your data added to it, [inaudible 00:27:29] added to it. And now I want to compute algorithm and train Machine Learning model, I’ll compute data analysis query over these datasets, actually over these two neighboring datasets. And then my algorithm is randomized. So the results of this randomized algorithm over the dataset that will produce are probability distribution. So we see that my algorithm is differentially private if the probability distribution generated from an algorithm over these two neighboring datasets with or without your data points is indistinguishable.
Jon Krohn: 28:04
I see.
Dawn Song: 28:04
To the attacker.
Jon Krohn: 28:06
Right. So if there’s a difference between the datasets, that’s where differential comes from.
Dawn Song: 28:11
Exactly. Exactly.
Jon Krohn: 28:12
But we can’t tell a difference in these probability distributions computer from it.
Dawn Song: 28:13
Right.
Jon Krohn: 28:13
Got it.
Dawn Song: 28:14
So if that is the case, that means then your data is pretty safe in the sense that as an attacker from just looking at the computation outputs, the attacker cannot tell whether your data has been used in the computation or not. And so intuitively speaking provides protection for your data.
Jon Krohn: 28:35
I see. What do you think about the Super Data Science Podcast? Every episode I strive to create the best possible experience for you, and I’d love to hear how I’m doing at that. For a limited time we have a survey up at www.superdatascience.com/survey, where you can provide me with your invaluable feedback on what you enjoy most about the show and critically about what we could be doing differently, what I could be improving for you. The survey will only take a few minutes of your time, but it could have a big impact on how I shape the show for you for years to come. So now’s your chance. The survey’s available at www.superdatascience.com/survey, that’s www.superdatascience.com/survey.
Dawn Song: 29:17
So that’s differential privacy and seeing our work with Meta, so we also use differential privacy to protect the computation outputs from leaking sensitive information about user’s input. And a couple things I can add.
Jon Krohn: 29:32
Well, yeah, so I’d love to hear about the PrivateSQL that’s related to this.
Dawn Song: 29:38
Exactly. So actually from this experience, a couple lessons that we have learned, that I actually like to some of the following work that I can share is, so overall all these technologies are great and there have been lots and lots of papers, hundreds of papers even more, written on the topic, but most of the papers actually mostly the great ideas have mainly been just sitting on the bookshelf. In the real world, in practice, we have seen very, very little real world deployments of these great ideas, these great technologies.
Jon Krohn: 30:17
Okay.
Dawn Song: 30:18
So for example differential privacy I think right now has been, so Google has done some special use case and Apple has done some special use case using differential privacy, and these are the leaders in the space, in industry, and even in their case, it’s still very specialized use cases. So overall, I think more and more people are hearing and learning about differential privacy for example, but most people have never really seen how it’s deployed in the real world, know actually how to use it. So this started as a research working my research group at Berkeley and then now Oasis Labs has been actually prioritizing this, making it actually as a commercial product is to really solve this problem is to make this differential privacy technology, in particular in this case, enabling differential PrivateSQL to make it really easy to deploy and hence more data scientists, more practitioners can benefit from this technology. So this also was motivated actually earlier from our collaboration with Uber actually.
Jon Krohn: 31:36
With Uber [inaudible 00:31:38] company.
Dawn Song: 31:39
Yes, a few years back. So basically for Uber for example, and also lots of companies have this, the same issue is that, they have a lot of sensitive users’ data. And then this data of course can be very useful to provide business insights to help them improve their service for users and also figure out how to best deploy their resources to improve their overall business. And so I feel that they want to have a lot of their other marketing departments or business departments and so on, so just in general they, for their business operation, they want to have different analysts or even just people in these departments to be able to somehow use the data to gain business insights. But of course given that this data is really sensitive, for example, Uber has every user’s rides data from [inaudible 00:32:39] and location and time and so on. It’s really sensitive information. So then they can’t give access to these marketing departments and these other departments and so on. And in fact, actually Uber in the past had fired employees.
Jon Krohn: 32:56
Uber is not famous for being very good about privacy.
Dawn Song: 33:00
So in past they had fired employees who actually misused their access rights to look up [inaudible 00:33:09].
Jon Krohn: 33:08
Yeah, God mode.
Dawn Song: 33:12
Right. Okay. So then essentially they’re in a dilemma. So either they just lock down access, so then they can’t use the data there. Employees cannot use the data to improve the business and improve also user experience overall. Or if they gave access, then they have this huge risk of privacy and data breach and so on. So then in this case, actually PrivateSQL is actually a great solution for them. Because with private SLQ, okay, so also let me tell you a little bit in terms of the technology for PrivateSQL.
Jon Krohn: 33:46
So PrivateSQL, anyone here can use, right?
Dawn Song: 33:50
So, that’s a technology that Oasis labs now is commercializing.
Jon Krohn: 33:55
Oh, you’re commercializing it.
Dawn Song: 33:55
Right. So what it is again for differential privacy, as I mentioned, it’s a formal notion of privacy and you actually, for each algorithm, you have to develop a differentially private vision of the algorithm. So for example, if you want to compute accounts, if you want to compute like sum or average and so on, you need to compute, you need to develop a differential private vision of these analytics. And depending on the actual dataset attributes, there are different differential privacy mechanisms that you can actually use, that’s best for your privacy, any utility trade off. So in any cases, so usually it requires a lot of expertise in differential privacy to actually use it. But of course most data analysts, they don’t really know about differential privacy.
34:53
And also it’s difficult to, you also don’t want to change the backend database to embed differential privacy mechanisms in the database, because then it’s difficult to actually deploy. You have to change the backend infrastructure. So what we have done with this PrivateSQL is basically you can view it as a layer in between the data analysts and the backend database. So basically it sits right in front of the data backend database. So when users use this queries, the SQL data analytics queries, they will get automatically rewritten by our PrivateSQL, this thing there, and it will automatically write the SQL query into a new SQL query with differential privacy and other privacy mechanisms embedded into it. And then this new SQL query, we call it essentially intrinsically private query, then gets executed on the backend database and then the database will return some results and our thing then will also do some post-processing in certain cases and then in the end return the result.
35:59
So the final result in this case will be guaranteed to be differentially private and also it can guarantee other privacy properties that your privacy policy dictates as well. So that in this way, as you can see, we change this deployment problem making it really, really easy. So in this case, we don’t change the workflow at all. So the data analyst doesn’t need to know anything about differential privacy, and we don’t change the backend database as well.
Jon Krohn: 36:27
Perfect.
Dawn Song: 36:27
All you do is just put this thing there in front of the database so it’s really easy to deploy. And now, so in Uber’s use case, now they can allow their analyst to use the data without worrying about data because privacy violation and so on.
Jon Krohn: 36:43
Great.
Dawn Song: 36:44
So that’s just one example of the real world use case and we have the pilots now also with the BMW and also in healthcare with hospitals and so on. So we do hope to actually bring this product to Google GCP and Amazon AWS.
Jon Krohn: 37:02
Nice.
Dawn Song: 37:03
Very soon. So we hope that more companies and more data analysts, data scientists actually can benefit from it.
Jon Krohn: 37:10
It sounds super useful. And those were crystal clear use cases and I bet lots of our listeners and our audience members here can’t wait to be able to embed that technology in their organization so that they can have potentially sensitive information still be used for business functions in a way that’s safe and secure. That sounds great. So I want to turn questions over to the audience in a moment, but just before we get to that, I want to bring up a topic that I think might particularly interest this Data Science audience. So you’ve authored many papers on deep learning and you’ve talked about it at length in talks that you’ve given and in interviews that you’ve given. So what is the relationship between deep learning and the topics that you’ve already talked about today? So responsible, decentralized intelligence, and in particular, I think there’s a term, federated learning, that it’d be great to introduce us to.
Dawn Song: 38:10
Okay, great. Yeah, so federated learning is actually another component in technology, that in privacy computing and that in particular for Machine Learning setting. And that actually in a second, I’ll talk about this responsible data economy as well. So federated learning, the goal of federated learning is again, you want to treat Machine Learning models, but of course a lot of the input is sensitive. So again, you want to have a good way to protect users’ data but still be able to train and use Machine Learning models. So the idea of federated learning is, unlike normal Machine Learning training, for example, where users data is collected and all sent to the central server to then train the Machine Learning models. Now of course in this case, as you can see, the central server now essentially sees everyone’s data and you really have to trust the central server not to leak your sensitive data.
Jon Krohn: 39:09
Right, so it’s similar. So where PrivateSQL allowed a data analyst to run queries and get summary information in a way that keeps database information private, federated learning analogously allows a Machine Learning algorithm to train without needing to bring in private information-
Dawn Song: 39:28
Exactly. That’s why it’s called federated learning. Of course, it can also be extended, we also call it decentralized Machine Learning in certain cases. So in federated learning, the users’ data stays on each user’s device, they’re not sent to the central server. So instead, these distributed devices, they would essentially work together in coordination by the central server in this case where only you can [inaudible 00:39:59] the model updates, for example, the upgrading updates is being sent to the central server, which aggregates these upgrading updates to train the Machine Leaning model in iteration. And it’s called federated learning because now you have this federation of different devices where users’ data only stays on users’ device and user data only states on user’s device, is not sent directly to the central server. And further extending to this actually decentralized Machine Learning, in this case, you don’t even have the central server, you can do this purely in peer to peer setting, actually using some other decentralized network to do this.
40:43
So, speaking of this, so the other thing I wanted to share also is that, so federated learning, and also actually together with the researchers from Google and the number of actually many other researchers, we have essentially a survey paper, federated learning was the survey and also talks about, I should cut the overview paper on federated learning. That’s actually I think the most cited paper in federated learning today.
Jon Krohn: 41:12
Oh, really? The most cited paper in federated learning?
Dawn Song: 41:16
In federated learning, I think, right. Has many thousands of citations. And that’s a good reference for anyone who is interested in learning about it. It really talks about the different types of federated learning, the different applications and open challenges and so on. But with that, maybe I can briefly talk about, so I talk about these different technologies, homomorphic encryption, secure multiparty computation and differential privacy, and federated learning and all these, I call them component technologies for responsible data use. So the goal is really, as we all know, data is a key driver of modern economy and also it’s really the lifeblood of AI Machine Learning, without data, you are not going to be able to learn anything, but of course a lot of this data is really sensitive and going forward, the problem is only going to get worse and worse. So it’s really important that as we do data analysis and Machine Learning, we enable this responsible data use, including providing a better privacy and also ensuring that users actually get fair value or get benefits from their data and so on.
Jon Krohn: 42:32
Nice.
Dawn Song: 42:32
So overall, this is what I call leading to responsible data economy. And in my mind, so there are three key principles of a responsible data economy. One, as I mentioned already, is providing better privacy and also more importantly, providing this data rights, so that users can actually take better control of their data and in this case also will help. So that’s one, this can serve as a foundation for ensuring that data is not being misused and also secondly, ensuring that data users get fair value, get sufficient benefits from that data. And the third one is how we can combine the society together to ensure that we do get max use.
Jon Krohn: 43:26
Max value, yeah.
Dawn Song: 43:26
Max value out of the data.
Jon Krohn: 43:27
Nice.
Dawn Song: 43:27
For the best interests and social welfare for the whole society.
Jon Krohn: 43:36
Yeah. So no question, we have a data economy today and in the future, data will drive more and more of the economy. It’s great to be thinking about these responsible aspects of it. So what can our listeners or our audience members here do in order to be developing a responsible data economy going forward? I’ve heard about your Colink initiative, so it sounds like that might particularly be of interest to the data scientists here as a decentralized platform that makes it easy to develop and deploy privacy preserving Data Science and Machine Learning technologies and applications. Could you tell us a bit more about that?
Dawn Song: 44:12
Great, thanks. Yeah, that’s a great question. So yeah, this also actually stems from our experience working with Meta. So as I mentioned, so we developed this privacy preserving technologies for model fairness assessment. And so the project has been very successful as it rolled out to the real world.
Jon Krohn: 44:32
It’s public now, right?
Dawn Song: 44:33
Yes, it’s public now.
Jon Krohn: 44:33
Colink. C-O-L-I-N-K.
Dawn Song: 44:35
Oh, okay. So first, right. So yes, but before that, so the project with Meta that I mentioned earlier, also it’s rolled out in the real world and it’s really the first of its kind, large scale deployments of privacy technologies for AI model fairness assessment. However, from that project, what we learned is actually the time it took us to develop the algorithm is actually a relatively small fraction of the overall system, because we need to deploy the system in the real world and this real world deployment, building the actual, in this case, distributed decentralized system, we actually have multiple nodes running to do the secure multiparty computation with different entities, in this case, actually several universities participated in this as well. The whole system development and deployment actually took the bulk of the time for the project. So then this got us thinking similar to the project of PrivateSQL, we wanted to make it really easy for analysts and developers to actually use differential privacy.
45:43
But then in this case, we want to make it really easy for developers to actually use and to deploy these type of new, I call responsible data use technologies. So that’s what we developed called the Colearn and the technology is called Colink but so the idea is we’re building this decentralized Data Science platform where, for example, the work we did with Meta I took many, many month to actually from when we already after designed the algorithm, had the initial prototype of implementation of the algorithm, and the rest just took many, many month to actually develop the production system. Whereas if we had the Colearn system back then, it could have shrunk this many, many month of engineering work into maybe just like one or two weeks.
Jon Krohn: 46:36
Very cool.
Dawn Song: 46:37
So really the platform, the goal is actually integrate all these technologies that we developed so you multiparty computation and federal learning, and we are integrating differential privacy, the PrivateSQL, technologies in there as well. So that’s super easy for developers to use any of the technologies and also deploy applications in the real world.
Jon Krohn: 47:02
Nice. So that sounds like a great take home for people here. If they want to be incorporating the kinds of decentralized elements that you’ve described today to allow for a responsible data economy, then Colink is the way to do that and parts of it are available now.
Dawn Song: 47:17
And you can go to the website colearn.studio to learn more.
Jon Krohn: 47:23
Colearn.studio. Nice. So sitting in the front row here at ODSC West is Serg Masís who is a researcher for the Super Data Science podcast and so he did a lot of the research behind many of the great questions that I asked today. So thank you very much to Serg, but then I’d like to open it up. We have time for a couple of questions from the audience. So the context here is that there might be lots of industries out there that would be interested in these technologies beyond the examples that you provide today, Dawn. So asset management could be an industry that would really benefit from the decentralized intelligence technologies that you described today. And then there’s a two-part question. So the first one is, do you think that companies adopting these technologies are doing it from an internal or an external motivation?
48:09
And then the second thing is, what are the practical implications of using a querying tool like PrivateSQL, if you just use the * in your SQL query, in normal circumstances, you could be pulling out credit card numbers or social security numbers. So how is practically the syntax different?
Dawn Song: 48:31
Okay, yes. Yeah, these are great questions. So first I think, for these companies, there are actually a number of reasons why now they are becoming more and more motivated using privacy technologies, both from external and internal concerns. Absolutely. And also there is more regulatory requirements today as well. For example, GDPR, CCPA and so on. So all these gives company more motivation now to deploy privacy technologies, which I’m really glad to see. I think the conversation today is very different from just five years ago, or even just two years ago. And I think that’s also why I really enjoy actually the conversation here is I think we do want to raise more awareness also for users and the society overall is to demand, have more demands, your data privacy and to give the feedback to companies that this is important and it’s important for them to deploy and adopt more privacy technologies.
Jon Krohn: 49:41
Nice. And then the practical question.
Dawn Song: 49:43
Right, so that’s number one and number two, so really depends on the queries. So in general, basically the PrivateSQL results, it will now return for example, like a specific individuals records with their sensitive data and so on. So usually for PrivateSQL, it is deployed, you can call it these statistical queries, aggregate queries. So in this case it will basically in the results that’s returns, so basically it will return you differential private results. And also you can, as I mentioned, so this query actually is very flexible. What we have done in the past is, of course you can embed differential privacy mechanisms, but you can also embed other types of requirements. For example, in our collaboration with Uber, previously they have put in certain policy requirements, for example, certain columns that simply just cannot even write. You cannot do certain operations on that.
Jon Krohn: 50:48
Nice. So Dawn, thank you so much for taking the time to be here and let’s give Professor Song a hand. Thank you.
Dawn Song: 51:01
Thank you again, thank you so much for having me. Thank you.
Jon Krohn: 51:08
What an incredible experience to be on stage with a luminary like professor Song in front of the engaged live audience of ODSC West filming a super Data Science episode for you. In today’s episode, Dawn filled us in on how responsible decentralized intelligence allows the training of compute intensive Machine Learning models that can be distributed across many machines without compromising on data privacy or security. She talked about how her responsible data economy collaboration with Meta AI enables Facebook and Instagram’s algorithms to be more fair without sharing sensitive information from Meta employees. She talked about how homomorphic encryption, differential privacy, and multiparty computation can work together to facilitate the vibrant responsible data economy of the future. How large language models like GPT-3 can memorize private data such as credit card numbers and social security numbers, and then reveal these private data to people who query the model.
52:00
How PrivateSQL makes differential privacy, easy to use, enabling a data analyst to access anonymous aggregated data from a database that includes private information. And she left us with her resources for data scientists to learn how to make it easy to develop and deploy privacy preserving Machine Learning applications. To get those, again, you can go to colearn.studio. All right. As always, you can get all the show notes, including the transcript for this episode, the video recording, any materials mentioned on the show, the URLs for Dawn’s social media profiles, as well as my own social media profiles at www.superdatascience.com/633. That’s www.superdatascience.com/633. Not just this episode number 633, but every single episode I strive to create the best possible experience for you, and I’d love to hear how I’m doing at that, for a limited time we have a survey up at www.superdatascience.com/survey where you can provide me with your invaluable feedback on the show.
52:57
Again, our quick survey’s available at www.superdatascience.com/survey. Thanks to my colleagues at Nebula for supporting me while I create content like this super Data Science episode for you. And thanks of course to Ivana, Mario, Natalie, Serg, Sylvia, Zara, and Kirill on the Super Data Science team for producing another trailblazing episode for us today. For enabling this super team to create this free podcast for you, we are deeply grateful to our sponsors. Please consider supporting the show by checking out our sponsors links, which you can find in the show notes. And if you, yourself are interested in sponsoring an episode, you can get the details on how by making your way to jonkrohn.com/podcast. Last but not least, thanks to you for listening all the way to the end of the show. Until next time, my friend, keep on rocking it out there and I’m looking forward to enjoying another round of the Super Data Science podcast with you, very soon.
Share on
Related Podcasts

July 22, 2025
Zohar Bronfman
- Artificial Intelligence, Data Science
- 77 MINS

July 18, 2025
Jason Corso
- Artificial Intelligence, Data Science, Machine Learning
- 30 MINS

July 15, 2025
Sebastian Gehrmann
- Artificial Intelligence, Data Science, Machine Learning
- 54 MINS