Podcastskeyboard_arrow_rightSDS 737: scikit-learn’s Past, Present and Future, with scikit-learn co-founder Dr. Gaël Varoquaux

90 minutes

Machine LearningData Science

SDS 737: scikit-learn’s Past, Present and Future, with scikit-learn co-founder Dr. Gaël Varoquaux

Podcast Guest: Gaël Varoquaux

Tuesday Dec 05, 2023

Subscribe on Website, Apple Podcasts, Spotify, Stitcher Radio or TuneIn


It's a special episode this week as scikit-learn co-founder Gaël Varoquaux sits down with Jon Krohn live at the historic Sorbonne in Paris, where they delve deep into the evolution of scikit-learn. From its origins as a memory-efficient Python implementation of support vector machines to its present-day status as a pivotal resource in machine learning, Gaël paints a vivid picture of its remarkable growth.


Thanks to our Sponsors:







Interested in sponsoring a Super Data Science Podcast episode? Email natalie@superdatascience.com for sponsorship information.

About Gaël Varoquaux
Gaël Varoquaux is a research director working on data science at Inria (French computer science national research) where he leads the Soda team. Varoquaux's research covers fundamentals of artificial intelligence, statistical learning, natural language processing, causal inference, as well as applications to health, with a current focus on public health and epidemiology. He also creates technology: he co-funded scikit-learn, one of the reference machine-learning toolboxes, and helped build various central tools for data analysis in Python. Varoquaux has worked at UC Berkeley, McGill, and university of Florence. He did a PhD in quantum physics supervised by Alain Aspect and is a graduate from Ecole Normale Superieure, Paris.

Overview
In this captivating episode, Gaël enlightens us on the evolution and future trajectory of scikit-learn. Initially born as a memory-efficient implementation of support vector machines in Python, scikit-learn has evolved into an all-encompassing ML compendium, resembling an executable textbook for machine learning methodologies. Gaël sheds light on the project's growth, from its origins as a SciPy fork to becoming an essential resource for ML practitioners, covering diverse approaches to machine learning.

Moreover, Gaël discusses the groundbreaking integration of cuPY into scikit-learn, enabling remarkable speed enhancements, with operations now running up to 10 times faster on GPUs. He emphasizes how this advancement opens doors for broader and faster data analysis, propelling the project into a new era of efficiency and scalability.

Gaël's insights aren't limited to developers alone; he shares valuable insights on contributing to impactful open-source projects, highlighting skrub, a promising initiative for data preparation in ML modeling, as a perfect entry point for aspiring contributors.

Beyond scikit-learn, Gaël's Soda lab’s impactful societal endeavors during critical periods like the COVID pandemic and within domains like diabetes and medico-economics underscore the tangible societal implications of robust data analysis.

Lastly, Gaël underscores the importance of statistical rigor in today's data-driven landscape, emphasizing that while tools can aid decision-making, domain expertise remains paramount for effective real-world problem-solving. Surprisingly, this only touches on the topics that Gaël and Jon cover in this enlightening episode. 

In this episode you will learn:
  • The early beginnings and growth of scikit-learn [05:34]
  • Development principles of scikit-learn [18:05]
  • How to apply scikit-learn to your ML problem [21:16]
  • Resource-efficiency and scikit-learn development [25:32]
  • How to contribute to an open-source project like scikit-learn yourself [38:21]
  • The future of scikit-learn [51:13]
  • Gaël on the social-impact data projects in his Soda lab [1:02:33]
  • Why domain expertise and statistical rigor are more important than ever [1:11:24]



Items mentioned in this podcast:

Follow Gaël:
Jon Krohn: 00:00:00
This is episode number 737 with Dr. Gaël Varoquaux, co-founder of scikit-learn and Research Director at Inria. Today's episode is brought to you by Gurobi, the Decision Intelligence Leader, by Data Universe, the out-of-this-world data conference, and by CloudWolf, the Cloud Skills Platform.

00:00:23
Welcome to the Super Data Science Podcast, the most listened-to podcast in the data science industry. Each week, we bring you inspiring people and ideas to help you build a successful career in data science. I'm your host, Jon Krohn, thanks for joining me today. And now, let's make the complex simple.

00:00:54
Welcome back to the Super Data Science Podcast. For today's very special episode, I traveled to Paris to record an episode live in person with the extraordinary Gaël Varoquaux. Gaël co-founded scikit-learn, the standard software library for machine learning worldwide, which is downloaded over 1.4 million times per day. He actively leads the development of the ubiquitous scikit-learn Python library today, which has several thousand people contributing open-source code to it. He's also Research Director at the famed Inria, which is a French short form for the French National Institute for Research in Digital Science and Technology. And there he leads the Soda, the Social Data Team, that is focused on making a major positive social impact with data science. He's been recognized with the Innovation Prize from the French Academy of Sciences, and many other awards for his invaluable work. 

00:01:43
Today's episode will likely be of primary interest to hands-on practitioners like data scientists and machine learning engineers, but anyone who'd like to understand the cutting edge of open-source machine learning should listen in. In this episode, Gaël details the genesis, the present capabilities, and the fast moving future direction of scikit-learn. He talks about how to best apply scikit-learn to your particular machine learning problem. How ever-larger datasets and GPU-based accelerations impact the scikit-learn project. How, whether you write code or not, you can get started on contributing to a mega impactful open-source project like scikit-learn yourself. He talks about hugely successful social impact data projects his Soda Lab has had recently, and why statistical rigor is more important than ever in how software tools could nudge us in the direction of making more statistically sound decisions. All right, you ready for this fantastic episode? Let's go. 

00:02:39
Gaël, welcome to the Super Data Science Podcast. I'm honored to be with you here at one of the Sorbonne campuses in Paris. Thank you for welcoming me here and booking a room for us. It's been a great European tour for me so far, and I mean to be sitting here with you, a co-creator of scikit-learn, it's unreal. Thank you so much.

Gaël Varoquaux: 00:03:02
Thank you. 

Jon Krohn: 00:03:03
So, yes, we know each other through Reshama Shaikh, which I also would've mentioned a few episodes ago when I was filming with Daniela at the University of Amsterdam. So when I announced that I was going to be doing this European tour, Reshama Shaikh reached out to a number of people in the different cities that I was going to be, and you're one of the people that said, "Yeah, let's do the interview." So thank you Reshama as well. So Gaël, let's jump right into a scikit-learn, I'm sure all of our listeners have heard of scikit-learn already. As we were doing research for this episode, I was blown away to learn that scikit-learn has 1.4 million downloads a day. It's absolutely insane. 

Gaël Varoquaux: 00:03:44
A fraction of these, how big we don't know, is probably bots doing legitimate things, but doing automated things. It's not 1.4 million new users, obviously.

Jon Krohn: 00:03:55
Yeah, yeah, yeah. And it's a part, obviously, of packages like Anaconda and these popular- 

Gaël Varoquaux: 00:04:01
But that also doesn't show on our download statistics. 

Jon Krohn: 00:04:05
Oh, really? 

Gaël Varoquaux: 00:04:05
The download statistics you have are probably from either PyPI or Conda-forge, and those record the automated downloads, but then you have all the account users, all the different packages, different distributions. 

Jon Krohn: 00:04:18
Oh, wow. Yeah, yeah, I didn't even think of that. So yeah, it's going to be even more than 1.4 million. Although, yeah, as you say, of course it's... When you're going to have a super compute cluster to train GPT-4.5, or whatever, all of those thousands of machines are all probably going to install some- 

Gaël Varoquaux: 00:04:35
The continuous integrations everywhere are probably downloading a lot. No, what's interesting, I mean, we don't know how many users we have, but we're looking at several millions, maybe several dozens of millions of users, and that's really interesting to me because it's a technical package. I would never have thought that what... The root is applied mathematics would be useful to that many people. Data science. 

Jon Krohn: 00:05:00
Yeah, absolutely. Yeah, an essential package, and it's amazing the breadth of topics that are covered. Actually, that'd be an interesting thing just to learn when scikit-learn started, because the scikit-learn predates me using Python. I was previously an R user and a MATLAB user were my primary packages for doing computational statistics, numeric operations and computing. And so by the time that I was using Python, scikit-learn was already established as the obvious player in machine learning. So maybe I think it would be interesting, certainly for me and I'm sure for a lot of our audience, to hear from the beginning, a rough timeline of what was the impetus behind starting it? And then when it first started being released, what did it cover and how has that grown over time? 

Gaël Varoquaux: 00:05:49
So the history of scikit-learn is sometimes not very well understood. I did some investigation recently, and was able to track it quite well. So in the SciPy package, so the SciPy package is a core numerical routine package. So in the SciPy package, around year 2005, something like this, started growing some basic machine learning code, and eventually it was moved out in a set package of SciPy, which was called the SciKits because it wasn't mature enough and everything. So it was growing there. Then 2008, 2009, myself, I started doing machine learning. I originally was a MATLAB user, I had transitioned to Python, and me and my team, we needed it basic machine learning tools, and we were recoding them in the lab, but that felt super inefficient. So we felt like, "Okay, we need to create something that's bigger." So then what we did was that we audited the different options. To be honest, I was convinced by none of them for different reasons, sometimes licensing, sometimes the API. And so then we decided we would start a new one, but we hated starting from scratch.

00:07:09
So what we did was that we turned to SciPy and we said, "How about we pan this thing out?" And we ended up actually replacing all the code, but what's most important is by spanning this thing out, we immediately had a community of developers around the project, and from the start, this was the project, community of developers. That's the start. 

Jon Krohn: 00:07:32
Of course, yeah. When you were, I guess, initially forking off of SciPy to the Scikit, what were the kinds of initial functions that the Scikit could do that you couldn't do in SciPy? 

Gaël Varoquaux: 00:07:51
Well, it was Fabian Pedregosa who did this. The first thing we did was that we did a good wrapper of LIBSVM. There was a wrapper of LIBSVM, but it wasn't good. It induced way too many memory copies, and so we knew Python quite well, and we knew that we could do things much better with much less memory copies. So this is the first thing we did. And then we were using these things, and then two set of functionalities came out. We needed linear models, we needed good linear models, and so we started working on this. The other thing is that we needed to use this in practice. We needed to be able to do model selection, cost validation, and so this is how we eventually went towards the API, which is now famous. It's just that we needed to be able to do our evaluations, and things felt wrong. We had started with a different API, but things felt wrong. So that's the two things that happened, linear models and in the API that allowed us to do model selection. And then things started happening that were from other people in other groups that we hadn't planned, and that's when it started getting really interesting. 

Jon Krohn: 00:09:03
Yeah, that is how it can happen in open-source. And I guess from the very beginning then it was largely academics that were working on this.

Gaël Varoquaux: 00:09:13
Well, yes and no. Mostly, but I had an acquaintance that I knew through the local... The Paris Python community who was in a startup, and who was really interested in machine learning, and who I knew was very talented. His name is Olivier Grisel, and almost immediately, we started talking and working together. And eventually, after a few years, Olivier actually joined us and is now an engineer. So he is, quote-unquote, in academia, but he's an engineer. He has worked in startups before, he has an engineering mindset. So from the start, it's the bridge between engineering and academia, and I think that's important.

Jon Krohn: 00:09:54
Yeah. And it is, of course, used by both groups equally these days, of course. So yeah, so started with LIBSVM, so support vector machines, which of course around 2008, 2009 would've been all the rage. I think there was a joke at that time that NeurIPS, which was supposed to be started... It has neural networks in the name, NeurIPS, was like SVMIps. 

Gaël Varoquaux: 00:10:18
Yeah, Colonel Land. 

Jon Krohn: 00:10:19
Colonel Land?

Gaël Varoquaux: 00:10:22
Yeah. 

Jon Krohn: 00:10:22
Oh, that's funny. Yeah. So then how has the project progressed since then? You mentioned other people getting involved. So at what point was there a decision to say, "Okay, let's make sure that we cover all of the key machine learning kinds of models that you might need to develop?" 

Gaël Varoquaux: 00:10:45
So fairly early on, we wanted to have something that was general purpose. Fairly early on, I thought, "What we would like to have is something that looks like a machine learning textbook, but that's actually runnable." And it wasn't really a decision, it's just that we kept adding things. I mean, my group and other people, and the diversity and the complementarity of models that go in is what provides the value. Going back to a textbook, we want people to get an understanding of machine learning as a process, as a statistical process. And so for this, we need to be able to provide different approaches to compare, because there is not one magic bullet.

Jon Krohn: 00:11:42
Yeah, yeah, yeah, yeah. And so I guess then this would've required people with different expertises. So each step of the way, as you're creating this textbook, this runnable textbook, you think, "Okay, well, we're going to add in linear regression, and then logistic regression, and then we're going to want to be able to have appropriate kinds of dataset splitting techniques." And so I guess at each step of the way, you can think, "Okay, well, we know so-and-so who's a regression expert or this person who's an expert at splitting data sets," and so you get their advice and they contribute. 

Gaël Varoquaux: 00:12:17
Yeah, so I remember, so one person that we specifically targeted was Peter Prettenhofer. I don't remember where he was based back then. For a while, I think he's still working for DataRobot. And so he was an expert in gradient boosting and we were like, "Oh, well, you seem to be doing cool things." And I remember, Peter was really cool about this because he was like, "Okay, cool. Let's merge. Let's merge what I've done with what you do, and we're going to build something that's bigger and greater." So then we also had people that we didn't target. We had Jean-Luc coming in, and Jean-Luc was a random force expert, and he was like, "Random force is so important." I knew nothing about random force, and so he contributed random force. So it's not like there was always strategy on targeting people. Sometimes there was, there wasn't. And then we had different people coming in. We had people who were also doing things like text applications, and they had an understanding of problems that we didn't have. Supporting sparse matrices was important, other details were important, and just things kept aggregating and becoming... 

00:13:26
So we had a lot of cultural differences, but those cultural differences were interesting because once we were able to understand each other, we were able to do better code. Code that was better from a numerical standpoint, but also better from an API standpoint because it was more general, so that was a really important part of the process. 

Jon Krohn: 00:13:50
Yeah, yeah. All right, and then so... Hopefully this isn't too contentious of a question, Gaël, but then as Theano came along for neural networks, and then later Keras, Tensorflow, PyTorch. With probably PyTorch now being the leading choice for neural networks, what's the perception of those other kinds of machine learning projects in Python from the perspective of people working on scikit-learn? 

Gaël Varoquaux: 00:14:15
So I remember very well Theano because we're a community, and we knew them. No, I mean, it was really interesting. I didn't believe in the thing, I thought they were over engineering it, but that's fine, that's part of the process. I don't have to believe in the thing to actually enjoy their presence. So Theano came in, and I remember actually very early on having design discussions with Yoshua Bengio, and we had a different point of view because I personally have a bias towards simplicity. I love simplicity, and these people were going for a very complex model, and so they did. And so they grew, things happened. Tensorflow arrived, and I didn't really like it because it felt like C++ done in Python. So I didn't like it from a technical perspective, and then PyTorch arrived, and I loved PyTorch, it has a very good API. And by the way, in the meantime, I had gotten more convinced that for certain applications, the complex models were a good thing, and so I find it really interesting. So we saw PyTorch grow, we had interactions with the PyTorch developers, and we are part of their community.

00:15:29
Then of course, there's the question of, "Should we try to be competing or not?" And the answer for this is clear, we shouldn't. They're doing an amazing job, we're doing different things, we live in an ecosystem. It's counterproductive to be competing. I'm actually very happy they're around. What matters is that we build a good ecosystem. 

Jon Krohn: 00:15:50
Gurobi Optimization recently joined us to discuss how you can drive decision-making, giving you the confidence to harness provably optimal decisions. Trusted by 80% of the world's leading enterprises, Gurobi's cutting edge optimization solver, lightweight APIs, and flexible deployment simplify the data to decision journey. Gurobi offers a wealth of resources for data scientists, webinars like a recent one on using Gurobi in Databricks. They provide hands-on training, notebook examples, and an extensive online course. Visit gurobi.com/sds for these resources and exclusive access to a competition illustrating optimization's value with prizes for top performers. That's G-U-R-O-B-I.com/sds. 

00:16:35
Right, exactly. And yeah, so it seems today like there's some kinds of things that I obviously go to scikit-learn for, there's some other things that I go to PyTorch for. And I think this is probably going to be obvious, I'm stating the obvious for a lot of our listeners here, but that same voyage that you described there, where the first version of Tensorflow, the 1.0, and the whole one series, it was wild how complex it felt to be able to run things. Although that also meant... Because that was when I got started with teaching deep learning to the public, and so there was a huge opportunity for Tensorflow instruction because it was so complex, and so [inaudible 00:17:15] created this weird cottage industry of people like me who needed to be creating all these tutorials, and doing all this teaching. And then PyTorch comes along and you're like, "Okay, it's Pythonic, it's easy. It runs right away, it's obvious." And yeah, I'm happy that I don't need to spend, so at that time, I was doing a five-Saturday course or six Saturdays, I can't remember, at the New York City Data Science Academy.

00:17:43
And basically, one of those days when I initially started teaching it, would go to Tensorflow 1, and just how to get things running in it. So how you can be doing operations without Keras, whereas the rest of the course we were just doing it in Keras, which seems to have been related to the inspiration of PyTorch. And so this actually ties to something that is... So it sounds like a clear development principle for scikit-learn is simplicity. 

Gaël Varoquaux: 00:18:13
Yes. 

Jon Krohn: 00:18:15
Are there other principles that you can convey straightforwardly? I don't know if that is a- 

Gaël Varoquaux: 00:18:22
Well, one thing we're trying to do these days is try to get people to do valid data science, and that's a challenging problem. But these days, we had a vision discussion at our last sprint, and a good vision statement that was put forward by Thomas Fan, he's in New York, is that our goal is to make it easier for people to do valid and useful data science. And so more and more, we're working on problems of model evaluation, problems of data preparation. So how do we fit in the bigger pipeline because we find often, in an application setting, that might be the weak point. So it's about simplicity, but it's also about validity and usefulness.

Jon Krohn: 00:19:11
Nice, very well said. And then there's also, from our research, it seems like being inspectable is also something that's important to you. 

Gaël Varoquaux: 00:19:22
Right, so the inspectable problem came from our interactions... At least on our side of the Atlantic, it came from our interactions with some of our funders. We have sponsors who give us money, and we need this because we have people working on the project, and part of these organizations come from the banking and insurance world. And in Europe, there is regulation that these things need to be auditable, and so they wanted us to help them make auditable things. And it was a really interesting process because it's back to validity. It's not just, "Hey, let's have a tool." It's about also understanding how you use this tool. And so actually, a lot of the work here was not only making the model interpretation tools, but also making documentation. And I hope it's useful, but making documentation with plain English and examples that discuss interpretation and pitfalls. Even for a linear model, it is easy to misinterpret things in a linear model. And so it's really back to what we are currently doing in scikit-learn, it's the tool, but also trying to make it easy for people to use it well. 

Jon Krohn: 00:20:43
Nice. And actually, we have a whole section on that coming up later in the episode, on these considerations around doing data science properly. And so yeah, we'll come back to that idea. We might end up touching on it again in between, but we do have a whole section on that prepared. So when you're thinking about making things simple... And actually that does actually tie into this conversation that we're going to dig into in more detail later, but when somebody comes into scikit-learn, without setting any arguments themselves, they can choose to just run a linear regression, or a random force, or whatever kind of machine learning model they'd like to run. And there are default hyperparameters, default arguments that are just set. How easy is it to do that? To have a default that's ready to go? 

Gaël Varoquaux: 00:21:41
So first, this was an explicit choice that we had. We don't want people to have to understand the integrated detail of a model to use it. That means that anything on which we can have a default, and what I mean by, "we can have a default" is that the model runs without raising a error. It doesn't mean it's correct, it's optimal. It means that runs without raising a error. Then we put a default, like default number of cluster. It's stupid, it's meaningless, but still, you run. So historically, we put defaults that seem reasonable. Honestly, we did some pretty bad choices. The thing is as we grew, we have more workforce, and so now we do a lot of empirical evaluation before we make choices, any kind of choices. But we also have an important point of view on backward compatibility. I want a upgrading scikit-learn not to be something that worries people, and so it means that any change we do is a slow one. It's not easy for us to change those defaults. I would gladly change some defaults, but I worry that it might break things that people are using. 

Jon Krohn: 00:23:08
So I guess then, do you have advice for our listeners on how they can be doing this ideally? What do you recommend, when somebody's getting started with some new... Maybe a machine learning model that they aren't that familiar with, but they have read online or GPT-4 has suggested that this is the scikit-learn code that they should be using. What do you recommend for them as a good flow as they're getting started with that new approach to them? Maybe they're not so familiar with all the arguments. Should they just give it a shot, use the defaults, see how it goes? 

Gaël Varoquaux: 00:23:42
They should. I don't think we have any defaults that are massively wrong. We could have defaults that are slightly better, but none of them are massively wrong. 

Jon Krohn: 00:23:52
Nice. Okay, that's a great answer. Yeah, and that's my experience. It is nice to have that reinforced. You're looking for confirmation biases all over the place, and that gives me a great one because that's definitely the way that I like to get going.

Gaël Varoquaux: 00:24:04
We've been having... In micro, but in other groups is, we've been having fairly systematic studies and we've seen that, for instance, we can gain a bit of accuracy by changing default. In this accuracy, we're looking at a wide range of data sets. This is valuable because many people don't change the defaults, and that's good in that, but it's there. And so if we can improve this a bit, we're making their life easier. Now the problem is we can't do this without breaking backward compatibility, so I don't know where we're going there. 

Jon Krohn: 00:24:40
Yeah, that is interesting. So backward compatibility, this obviously sounds like something that's going to be a recurring headache anytime you're doing updates, but we really appreciate it as a community using your tool, because unlike many other libraries out there, it is nice to not worry about the machine learning code that we implemented in scikit-learn crashing when we update libraries. So you touched on this a little bit earlier as well, in fact, you touched on this in your genesis story for scikit-learn, it was about efficiency. So it was the LIBSVM, the support vector machine implementation, where you wanted to ensure that it was resource efficient, that the memory copies weren't getting out of hand. So that resource efficiency is also, it seems like a core principle of scikit-learn development. So what kinds of strategies do you need to have from your side managing the development of scikit-learn to ensure that there is this resource efficiency? 

Gaël Varoquaux: 00:25:42
Well, speaking about resource efficiency, as you say, it's resources, it's not only computation. So there's a gradient of complexity. We can go towards micro optimized code on GPU. Is that the right thing to do? Well, maybe not for several reasons. We have limited developer resources. As we're getting more developer resources, we are starting to tackle things that we weren't tackling before, but also which fraction of our users have a big GPU, which... What's the bottleneck in the pipeline? I don't know, by the way, this is a difficult problem, but if the bottleneck in the pipeline is data preparation upstream of scikit-learn, then it's not useful. What are the models that are most used, and which ones should we optimize? So from a big picture, these are the questions that we're asking. Now in terms of optimization, once again, it's a trade off. We need to keep the code manageable. As our community of developer grows, we can go towards more complexity, but there is also a problem of managing complexity.

00:26:55
If your code base becomes too complicated, only a few people can handle it. So we're navigating this trade-off. What we've been doing historically... Well, first a lot of work on the algorithms. Have good algorithms, it's always made a big difference. For gradient boosting, for instance, a few years ago, we've implemented a histogram-based gradient boosting, that's a trick that was developed first in GBM. And so just doing this makes a massive difference. Most people don't know about this, by the way. We have the hist gradient boosting classify and regressor, and they tend to be faster. And then it's moving down the low level optimizations. So being careful about memory copies, using the array manipulation objects well, and then maybe moving things to lower level optimized code. We're currently exploring having backend mechanisms that allow hardware specific plugin to come in, which would allow micro optimize code for a given CPU and maybe even for a given GPU, we're looking at this.

00:28:14
We're also, in the last years, we've started exploring using the array API to be able to move in cuPy arrays, and things like this. And part of the scikit-learn code today works with GPU arrays. People don't know this, and we haven't been advertising it too much because first it's only parts. And then second, the question is, "How important is this in the broader flow of data science?" But as we're getting better, as we're supporting this more and more, we're going to advertise it and to see how we can connect the dots, because it's about making things easier. And if people have to convert their array from cuPy to NumPy, and when say you move in a PCA, you convert to cuPy, you move back. It's going to be hard for most people. And so one question in the future is, "How are we going to be able to streamline this?" I don't have an answer, but I'm pretty sure we're going to make progress there.

Jon Krohn: 00:29:23
Yeah, it does seem to be like... I didn't know about these cuPy integrations that you're working on scikit-learn, but it makes sense to me. There was actually, at the time of us recording today, yesterday, NVIDIA announced Pandas integrations directly. And literally, I just saw this on my social media feed for a few seconds yesterday, so I can't articulate it very well in this episode. But the upshot is that now for some kinds of operations, by leveraging NVIDIA GPUs, Pandas operations can be accelerated by 10 times, up to maybe even 1,000 times for some kinds of operations.

00:30:04
Yeah, it does seem to be like... I didn't know about these cuPy integrations that you're working on scikit-learn, but it makes sense to me. There was actually, at the time of us recording today, yesterday, NVIDIA announced Pandas integrations directly. And literally, I just saw this on my social media feed for a few seconds yesterday, so I can't articulate it very well in this episode. But the upshot is that now for some kinds of operations, by leveraging NVIDIA GPUs, Pandas operations can be accelerated by 10 times, up to maybe even 1,000 times for some kinds of operations. 

Gaël Varoquaux: 00:30:48
So to draw a parallel, there are two things here. We can either take cuPy arrays, which we can already do, and you're probably looking at these 10x speedups in some places. Not everywhere, but in some places. We don't have merged in the PCA backend for cuPy arrays, but that's typically a place where you can get 10x speedups. And then there is also swapping in an internal computation, and that's typically for low level operations. And this is something we've been prototyping, and it's still at the prototype level, but the feeling is that there quite a few places where you can get also 10x speedups. And now the question is, if we do this, we're going to have increased complexity on our side, and how do we make sure that we have the resources to tackle this? 

Jon Krohn: 00:31:49
The human resources. 

Gaël Varoquaux: 00:31:51
The human resources. 

Jon Krohn: 00:31:51
Yeah. 

Gaël Varoquaux: 00:31:51
We've had a very good working relationship with NVIDIA. 

Jon Krohn: 00:31:53
Oh, nice. Yeah, they're the people to partner with on GPU operations. It's wild with CUDA, the stranglehold that they have, it's amazing even in a time like today, where there's this race to be able to stockpile GPUs, to be able to train large language models, say in particular, and to be able to run inference even more importantly on large language models, tons of startups like my own, Nebula, we need access to NVIDIA GPUs in the cloud in order for our application to run. And there's thousands, tens of thousands, maybe hundreds of thousands of these kinds of generative AI startups like ours that all want to be able to do this, on top of already the big tech players who need these running for their Google Bard and for their ChatGPT. It's wild to me, the stranglehold that NVIDIA has on the world at this time, where there's such a wild demand for GPUs. Where the big tech companies obviously need tons of GPUs, not just for training their models.

00:32:56
Like GPT-5 coming up for OpenAI, Google Bard not only need to be trained, but the 99% or more of the compute cost is on inference. And so there's this huge demand from all these big tech companies with their big GPU services, there's tons of small startups, like my own, that need access to these for our generative AI capabilities in our platform. And despite all of this need for GPUs because of this amazing software, CUDA, that NVIDIA has, they maintain almost a monopoly on GPU usage. There's alternatives out there. AWS has their Trainium and Inferentia alternatives, Intel has alternatives, I think AMD has alternatives, but we don't really see those being taken up, even though there is this far more demand than capacity amongst NVIDIA GPUs. 

Gaël Varoquaux: 00:33:59
Right, so I think you're saying important thing. It's partly, and importantly, the software layer. And it's not only about CUDA, it's about the drivers, it's about all the little details of the stack. And NVIDIA has been really good at making sure that the stacks work really well with NVIDIA. And so indeed, it ends up creating a bottleneck. And here, I always go for simplicity. What matters to me is impact. Impact, say, in healthcare. And in many places, the evidence is we don't always need very complex models, they actually may be counterproductive because they're harder to audit. And so there's two things about this. First, there is some form of fashion of going for the complex thing that sometimes doesn't pan out. And then second is, so NVIDIA has been providing excellent GPUs and excellent software stack. My question, and I don't have any answer to this, is how much could we use simpler GPUs? 

00:35:02
And one of the bottlenecks today is the software stack. And from a big picture, it's disappointing because even the small data scientist working on the laptop has access to a GPU. Could that be used to speed up data science? I don't know. 

Jon Krohn: 00:35:20
Yeah, just like their Apple GPU. 

Gaël Varoquaux: 00:35:21
Exactly. 

Jon Krohn: 00:35:22
Yeah, yeah, yeah. It is interesting, there's something really obvious there.

Gaël Varoquaux: 00:35:29
It's a software stacked problem. And back to the fact that we have an excellent working relationships with NVIDIA, NVIDIA goes out, NVIDIA doesn't only worry about their chips, or their drivers, or CUDA. They go out, they talk to people. They're currently funding someone who works on scikit-learn, but really works on scikit-learn, not on GPU and scikit-learn, just making sure that scikit-learn is healthy. And that's the kind of relationship we have with NVIDIA, and I think it's very impactful because they try to understand the whole ecosystem, and to make sure the whole ecosystem works well, and works well for their stack. 

Jon Krohn: 00:36:11
And one of the wild things about this is how they called this so early. Now, okay, it seems obvious, and they have become this entrenched player in artificial intelligence, machine learning computing. But 10 or so years ago, when they would've made that call, and I think it was around the AlexNet moment that NVIDIA's CEO decided, "Let's go all in on this. This is going to be big," because AlexNet was using two GPUs, but that was a risk. We didn't necessarily know that deep learning was going to take off the way that it did, and the GPUs would be the obvious choice. And so yeah, remarkable that NVIDIA made that, what would've seemed like a big bet at the time, going all in on machine learning and AI, and getting involved up and down the tech stack across the ecosystem, like you're saying. And yeah, it's really paid off. 

Gaël Varoquaux: 00:37:07
There's something fundamental about GPUs as in not graphical processing unit, but as in a different structure than the CPU. And whether it's GPUs, TPUs, you name it, but it's clear that there is an edge going to more vector processing unit. And indeed AI is pushing this forward because it's basically a lot of numerical algebra. But if you think about a lot of numerical algebra, it is useful to think about new hardware. People have known this for a long time, and the question has always been, "How do we make a compilation stack?" And in this sense, what has happened in deep learning, which is basically... Well, we worry about gradients has been amazing, and this has empowered GPU. The question is how can this carry over to other kinds of machine learning? It's harder.

Jon Krohn: 00:38:08
Yeah. So yeah, fascinating digression onto hardware there that I wasn't expecting. That was nowhere in my plan, but really exciting. So I'm back to development of the software, of scikit-learn. If we have a listener out there who hasn't contributed to scikit-learn before, but they have some idea about a new feature that they'd like to add or some change that they'd like to make, how should they get started? And in particular, my understanding is that good documentation is critical. 

Gaël Varoquaux: 00:38:38
Good documentation is critical. It's hard, it requires didactic skills, it requires good wording in English, it requires to be able to say complex things in a simple way, and there are many aspects to it. It could be improving a layout somewhere, it could be improving the structure of a document, it could be improving an example. We try not to add examples, we have enough or we try to refrain on adding too many examples. We clearly have a reviewing bottleneck, and I think this is something fundamental. We're seeing GitHub Copilot, we're seeing that it becomes easier to write code that looks okay. And the problem is knowing that it's okay, and that's actually harder. Reading code is harder than writing code. For documentation, I'm very happy to announce that a few months ago, we created a documentation team that can review a documentation poll request. So this has allowed us to lighten up our backlog of reviewing.

00:39:50
So for people who like explaining things and who want to contribute, which is absolutely crucial, there's actually a way forward, which is not only submit poll requests, but it's much easier today to get in the documentation team than to get in the developer team. Much easier and requires different skillset. So I see really an opportunity of growth for scikit-learn. And here it's happening, and I'm really excited about it.

Jon Krohn: 00:40:20
Nice. Yeah, so you're saying that there is this parallel opportunity, so not only can people be contributing code, so these people who are on the documentation team, they might not necessarily have a coding background. Maybe they could have a technical writing background.

Gaël Varoquaux: 00:40:36
Yeah, yeah. Of course, of course, of course. And coding backgrounds are overrated. I studied physics, we have someone in the team who studied biology. No, it's not about the background, it's about the affinity. I think people, they can be data scientists, you probably need to have some understanding from a user perspective, but the kind of technical skills that you need from a developer perspective are different. They're about software architecture, they're about numerical stability. And so now we have a way forward for people who don't have those super advanced numerical coding skills to progress in the project. And I think that's crucial because as the project has gotten bigger and more technical, there was a danger that we would become too technical that all of the people reviewing poll requests were really good in media structures, but not in wording in English. And so our goal is to create the spectrum of opportunities for different people with different skills. 

Jon Krohn: 00:41:49
Awesome. And then so for people out there who... Maybe me just suggesting this for the first time, is something that they should possibly be interested in, getting involved in open-source projects like a scikit-learn project. There's a core team that does this full-time... Actually, so that's an interesting way to go. So obviously there's a core team that does this full-time, and then there's also people who just... I don't know, thousands of people that contribute just voluntarily. So why should somebody think about contributing voluntarily to an open-source project like scikit-learn? So that's question number one. And then the question that follows from that naturally is then for somebody who maybe has already been doing that for a while, and would like to make the jump to being a full-time scikit-learn developer, how do you do that?

Gaël Varoquaux: 00:42:40
So why would you contribute to something like scikit-learn? Well, historically, most people were doing this, quote-unquote, to scratch an itch. There's something in the projects that you'd like to improve. That's the biggest motivation, it works really well, it gives the right kind of feedback loop. If you want to do this, it's really important that you first start to talk to other people, other users and other contributors, to know whether this aligns with the kind of vision that is shared across the project. We're currently trying not to grow too much because a project that grows too much basically collapses. We're focused on not always adding new estimators, but rather improving the ones we have, and worrying about the full pipeline inspection, maybe parallel computing, data ingestion, facilitating what we have often. And then there's the whole process of, well, discussing this online with an issue, contributing code. And then it's a social thing, for the better or the worse, but it's a social thing. There are many dozens of people interacting on this project, and the central people have a resource allocation problem.

00:44:07
They have too many requests for help and review. And so for the better or the worse, they're going to allocate their time where they think that there is going to be a positive feedback, a useful feedback for the project. And so what really happens, in terms of if you're really passionate about this and you want to get involved, I suggest figuring out what's useful for the core developers for the project, helping there. And as you help there, you progressively understand better the dynamics, and you progressively more and more useful, and you're basically moving up the ladder of skills, but also, you become known to the project. Now it's a slow process. And if at some point, you want to become more full-time on this, typically by this point you have the network, you should be talking to us. It's a struggle from our side to balance the funding, balance all the different opportunities because the core people in the project are working in very different settings. We have people working for startups or big companies in the US or universities all across the world.

00:45:26
We have funding that comes from partnership, from government funding. As all large operation, we're struggling with this, but if we have somebody who's talented and excited about the project, and we're able to identify this, we're always happy. And maybe one last thing, every once in a while, we organize sprints. And some of those sprints are onboarding sprints. And we're especially careful to organize onboarding sprints for people who are historically underrepresented in the community, and that's something important. We will invest more time here because we need to correct this off balance. And that's just a great way of becoming involved in the project, and we try really hard to be very friendly and welcoming, and I think often we succeed. 

Jon Krohn: 00:46:22
Data science and machine learning jobs increasingly demand cloud skills. With over 30% of job postings, listing cloud skills as a requirement today, and that percentage set to continue growing. Thankfully, Kirill and Hadelin, who have taught machine learning to millions of students, have now launched CloudWolf to efficiently provide you with the essential cloud computing skills. With CloudWolf, commit just 30 minutes a day for 30 days, and you can obtain your official AWS certification badge. Secure your career's future, join now at CloudWolf.com/SDS for a whopping 30% membership discount. Again, that's CloudWolf.com/SDS to start your cloud journey today. 

00:47:04
Nice. Well, that is probably a reassuring thing to hear because there's probably listeners out there who come from a background that is underrepresented in data science. And so to hear that there's this deliberate support for these people and that it's a friendly thing, that it's not a... Because I think it can be an intimidating... I can't think of the right word, like an intimidating block, like an intimidating monolith to look at from a distance to see, "Oh, this GitHub repository." And to see all this activity and, "How do I get started?" So I guess for people in general, but then maybe for underrepresented groups in particular, you talked about getting started in a place where there is the most need. How do you identify that? How can a listener identify what is the most need?

Gaël Varoquaux: 00:48:00
That's super hard. I mean, we struggle ourselves. We have a big information management problem. We tag issues, "Help needed." The problem is the minute we tag an issue, "Help needed" somebody wants to work on that issue, and the easy ones, the one that are "Help needed" and easy, those get done immediately. I'm not being very helpful here. And to be honest, if you want to get started, maybe the easiest thing is not to get started in scikit-learn, is to get started in other simpler projects that allow you to understand the kind of dynamics. The GitHub dynamics, the reviewing dynamics, and everything. I'm passionate about open-source, I was passionate about open-source. I realize it's hard to get started in scikit-learn these days. I don't want people to give up. I think people should be passionate about open-source, and it's a process. You get started by a little thing, and you understand things better. So I don't have a good answer to this.

Jon Krohn: 00:49:13
So then, I mean, this puts you on the spot and sometimes it's tough to come up with an answer like this immediately, but then if there is some other open-source project or projects that have a similar ecosystem today that a listener should maybe consider getting started on, do you have any particular recommendations?

Gaël Varoquaux: 00:49:32
Well, I'm terribly biased, of course. I'm going to pitch the project I'm excited about these days. It's about smaller code base, it's about smaller group of people. It's really hard when you have literally hundreds of people to remember who is who. Yeah, so recently we've started a project that we call Skrub, S-K-R-U-B, which is for data preparation before scikit-learn, typically DataFrames. I'm super excited about it. It's not even being released, it's out there on the internet. It actually has a code that has a longer history, so it's not an empty shell. It has a lot in it. And here it's easier to get started. Now of course, I'm pitching my own project, I'm sure there are other projects. 

Jon Krohn: 00:50:22
No, but that's great. That's great. I mean, that makes a lot of sense. So Skrub, S-K-R-U-B, for pre-processing data before they go to scikit-learn or some other machine learning package for modeling or analysis. This sounds like a really cool project. I hadn't heard of it, which I guess is unsurprising, it hasn't even been released, yeah, but that's fantastic. So I'll be sure to include in the show notes. So you can head to SuperDataScience.com/737, and we'll have the link to Skrub, as well as of course, all of the other kinds of important links that we've talked about in today's show. Fantastic. All right, so that gives us a really good sense of where we are with scikit-learn today. We started off the episode by talking about how scikit-learn started, so now let's talk about the future. So there is a yearly KDnuggets poll, and in the most recent one, data scientists said that the dataset size that they work with the most is between one gigabyte and 100 gigabytes. And interestingly, that's the same as a decade ago. 

Gaël Varoquaux: 00:51:37
Yeah, it's not moving.

Jon Krohn: 00:51:40
Yeah. And so does this have any implications for the future development of scikit-learn, especially with data-centric AI? 

Gaël Varoquaux: 00:51:50
Well, so I've been looking at the same numbers, and this number is interesting because it's the number where you're just borderline happy on a laptop. It works, but you need to be careful not to copy the data a few times. So that's something I'd really like to have, and it's going to take a little while, is to be able to work better with almost out of core. And so I mentioned that we're getting better at having things like cuPy arrays. Maybe one day we'll be better at having things like Dask arrays, or Polar's Lazy array. But Polar's Lazy array actually live in the Skrub world because... Well, they're not Lazy arrays, they're Lazy data frames. And so data frames live more naturally in the Skrub world. So once again, I'm thinking ecosystem. I'm thinking, "How can we make it easier for people who work with these kinds of data sets?" There're probably a few tables that take 100 megabytes or a few gigabytes. And if you look at the problems, a lot of this is data preparation, and so I really want to streamline data preparation. Now that's Skrub going back to scikit-learn. 

00:53:15
Our efforts are on better model inspection, which isn't easy because we need to understand which methods are reliable, convey the proper understanding to the user. Better model validation, evaluation. Here we're looking at better metrics, better choice of metrics. It's a lot of details. One thing I would really like to see, and I hope we're going to see this in a little while, is to be able to have a report on a model that reports many different things in terms of evaluating a model and auditing model. Ideally with guidelines on... Well, for instance, you have a severely imbalanced dataset, you should probably not be looking at accuracy, so really helping the user. One thing I'd like to see, but the work hasn't started, is to get more easily inspection of a grid search or random search for people to easily know what their hyperparameters, the important hyperparameters are... So some effort I would really like to see is really going in this usability. A feature that was merged just recently that I'm excited about, it's really a detail.

00:54:38
So these days, if you do a representation, a rep, of an estimator, you have a small diagram, and we want to improve this. So there's a bit of engineering that goes on, but just recently, first we have the color of the diagram that changes when the estimator has been fitted versus not fitted, because that's an easy mistake that people do a lot. We have a little question mark button that when you go on it, it sends you to the online webpage of the estimator, and I'd like to keep improving this. So here it's interesting because we're actually going into user experience. And in terms of opportunities of contributing, by the way, we don't have many people who are really good at UX. This requires front end skills, CSS skills. It's hard because we can't rely on classic frameworks such as Bootstrap because we need to be able to work in Jupyter Notebook, in VS Code. And so here, there's clearly, in my opinion, a lot to be done. 

Jon Krohn: 00:55:47
That's very cool. Yeah, user experience. 

Gaël Varoquaux: 00:55:49
I think those are the biggest things I'm excited about. I hope we're going to get there, maybe one day we're going to get basic AutoML, or maybe I'm overselling it when I say AutoML, but basically better code to select hyperparameters. We have a bunch of efforts that go in there. I'm really hoping that we're going to get what we called hyperparameters spaces, that's going to take a while, but the ability for models to suggest what are the good range of hyperparameters that should be explored, that should simplify people's life massively. One recent thing that we had that's super important is that we have metadata routing. Now that's incredibly technical, but that's the kind of stuff that allows you to do a cross validation that accounts for data that's non-IID, where you have natural groups in it, it's the kind of stuff that allows you to build fairness measures to build causal measures. So it's super technical, I hope it's going to pan out. 

Jon Krohn: 00:56:55
Yeah, those are all fantastic exciting areas to be going into. Obviously, this kind of UX focus, super interesting because of the kinds of complexities you mentioned, like having to be able to work in Jupyter Notebooks, and yeah, AutoML would obviously be hugely useful. 

Gaël Varoquaux: 00:57:13
I forgot one thing we're working on really hard, better logging, and that's also really complicated. So the challenge we have in scikit-learn is we don't want to be a framework, we want to be a library. But things like this, you better done in your framework. So that better logging is... We have Jérémie du Boisberranger working on better logging that's quite extensible to be able for people... So it will have trivial consequences, so progress bars on fitting pipelines, but it will have also more fundamental consequences, ability to monitor details of the pipeline. I would hope that this enables us to do better MLOps, which is something we're looking at. 

Jon Krohn: 00:57:58
Very cool. For our listeners, how do you distinguish... you said this idea, you're like, "We want to make sure that scikit-learn is a library, not a framework." How do you distinguish those two terms?

Gaël Varoquaux: 00:58:09
Library is a set of objects or functions that are contained, that don't need setting up a central engine actor manager, and it's really important because they're much easier to be reused. The goal is to be able to embed in whatever you have, and we have this tension. You do MLOps, you probably want to have a database, maybe an orchestration engine, and so we need to make scikit-learn in a way that it plays well with these things. 

Jon Krohn: 00:58:45
Gotcha, gotcha, gotcha. Yeah, so the library is more standalone, whereas the framework, it's something that depends a lot on externalities.

Gaël Varoquaux: 00:58:53
Has services typically. 

Jon Krohn: 00:58:54
Yeah, yeah, yeah. Cool. Nice. So yeah, that's a really fascinating look into the future. Something else, you talked early in the episode about some people coming on early in the scikit-learn project who were specialists in sparse matrices. And so that is from the natural language world, where we need that a lot of the time, where you might have a matrix where every row is every word in the vocabulary of the natural language data set, and every column is an individual position, a word in a whole document. So you end up having these matrices are almost all zeros. Personally today, I work with natural language data a lot, I don't typically go to scikit-learn first. Is that something that... 

Gaël Varoquaux: 00:59:49
I think that's the right choice. I mean, I wouldn't do NLP, and I do research in NLP, and we don't use scikit-learn. The specific problem I'm interested here is when you have mixed data, you have a table, and in there you might have open description of jobs, you might have reviews, and then the question is how do you combine? And here, I think that this question is still open. We have early evidence that it may be useful to use complex models, language models, large language models to vectorize your data. And here I hope that in the future, in the ecosystem level, at the scikit-learn level, we're going to make it easier for you to be able to join scikit-learn with those tools, transformers for instance, because that's what people need. They have a dataset that has mixed entries and how do they use it in the easiest way.

Jon Krohn: 01:00:53
Nice. Great. Yeah, it sounds like you've got the right plan, and yeah, that interoperability is obviously the key. And yeah, scikit-learn is, despite being a standalone library, obviously so useful as a component of some other workflows, like maybe some natural language processing workflows. There might be a common thing that would happen with our data science R&D at our startup Nebula would be, where we're using a transformer architecture to make some predictions. But then once we have our results, we're using scikit-learn to analyze the results, and so it's a key. 

Gaël Varoquaux: 01:01:29
And so back to the GPU question, if you're doing NLP, at least at scale, you need GPUs, there's no way around this. If you're doing NLP, if you're doing your tables with a few texts, maybe you don't need GPUs, and actually your CPU is fine. So that's also where we sit. We don't want to force people to use really big tools that they don't need to, but we also don't want to get in the way of people who need to use those big tools. That's really the trade-off we're trying to navigate. 

Jon Krohn: 01:01:58
Yeah, it makes perfect sense. In your answer to this NLP question that I posed, you were talking about your research that you do, and so let's transition to that. So you're a research director of the Soda team, which I haven't been able to work out exactly how that is, if that's an acronym at all.

Gaël Varoquaux: 01:02:18
It stands for social data, but it's more having fun with... 

Jon Krohn: 01:02:20
Oh, yeah, yeah, yeah. I see, I see. Social data, Soda. And yeah, so at Inria, which is this distinguished French institute, research institute for Digital Science and Technology. You're a research director of this Soda team, social data team. And yeah, so it's doing research at the intersection of machine learning, health and social sciences. Can you elaborate on what you're doing on Soda and why it's so exciting? 

Gaël Varoquaux: 01:02:48
So Soda started a few years ago because I'm convinced that data can improve many aspects of society, but I'm also convinced these days that on many aspects, it's not just going to happen like this. You're going to have to invest in it. And I know the health applications reasonably well because I was working in brain imaging before, and I like the health space because... Well, if we improve health, we're clearly improving society. But also because we're talking to a very mature set of domain scientists who... Epidemiologist, know that there is a process to go between insight from data to actual actions on health. And there's health in informaticians who... Medical informatics that really does know engineering in hospitals. And so here we're embedding in a specific application, where I think it's a good landscape for machine learning to come into play. But Soda isn't only about health, we have one of our researchers who is into educational data mining. And from a pure formal perspective, there are links between those things. And what we try to do is that we try to... Well, first have a positive impact, we realize it's hard.

01:04:12
And then second, as we realize it's hard, we try to address the problems that we find haven't been addressed well, and the edgy thing, this is super useful for many other applications because many of the problems that we find are problems that are present in many applications of data science. It's beyond prediction, it's from prediction and machine learning to the impact. 

Jon Krohn: 01:04:38
Very cool. Yeah, and it's amazing you do that. I mean, everything that you're doing in your life, I thank you. We all thank you for what you're doing from... It's probably an impossible thing to react to live as I see these kinds of things on camera. But everything that you've been leading with the scikit-learn library, these 1.4 million downloads a day, even if a lot of that is bots, that those bots are doing work at the behest of humans, probably mostly still today. And so the impact of this scikit-learn project has been, and will continue to be, instrumental for data scientists and all the applications that data scientists power for decades to come. And then on top of that, this Soda work, where you're explicitly doing work with data science that is making a positive impact and on problems that aren't being addressable. I mean, yeah, it really is amazing work. Do you have some maybe one or two specific examples of projects right now that are ticking both of those boxes?

Gaël Varoquaux: 01:05:43
Well, I mean, maybe I can go back a few years ago, and I can talk about the best successes we've had, even though they haven't been published. A few years ago, COVID comes in, bang pandemics, and the hospitals are starting to get overwhelmed. And so the local hospitals here basically establish a call. They say, "Hey, we have essential information system that has a lot of data what's going on. It's a network of hospitals, 39 hospitals, so there is a lot of data, come and help us." This was really fun because I gave up everything I was doing, I abandoned my poor grad students. I was actually in sabbatical back then, so it was easier. And with a few other people, including other people from scikit-learn, we just hopped in and worked on their database. So we're working on a SMART database. Had never worked on a SMART database, and so here, the challenge was mostly data preparation. Once we got to scikit-learn, it was trivial. And the other challenge was understanding what we could do that would have an impact in those very short timescales. So that was really interesting. 

01:07:01
We ended up doing things like forecasting of beds or information extraction... Back then it was done with very simple models, no big language models, information extraction to understand about the comorbidities and the age groups. And then the most advanced things that we did were causal inference to try to figure out what kind of treatments were beneficial. This was never published, but here I had the impression to have a very positive impact. Currently in this space, we're trying to work on the progression to diabetes, the early stages of diabetes, where there are ideas that we might be able to prevent complications. And then the question here is to provide evidence... Well, understanding in evidence on what's possible, because in health, you don't put automated back boxes in production. You want guidelines basically. And what I would like to do, but haven't started doing, is work on what we call medical economic problems. So basically it's resource allocations, and I think that's quite important, and that's where data science can help. 

Jon Krohn: 01:08:28
Yeah, that last one there sounds really interesting, and isn't something I think I've ever talked about on air before, but this is obviously huge. It's relatively common on this show, and I think in a lot of environments, to talk about specific ways the data science or machine learning are allowing some new medical capability, but these medical capabilities don't happen in a vacuum. We have different structures. In the US, we're mostly private, but in Europe, it's mostly public healthcare. And so there's only so many people, there's only so many financial resources to be diverted. And so how can those human, and financial, and structural resources be targeted most effectively to have the biggest impact, to extend lives the most or have the highest quality of life?

Gaël Varoquaux: 01:09:21
So every action we take has a cost, and it may not only be about financial costs. I'm going to take an example that's mostly related to machine learning, weakly related to machine learning. Statistical analysis, using actually fairly simple model, has developed cardiovascular risks factors. And so the question is, "Okay, so you know some people are more at risk than others. You also know that some prescription statins reduces the risk of a major cardiovascular accident." So then you might prescribe to all the people who are at risk statins. Statins give you muscle aches. It's not only their cost, it's that they have also detrimental impact on the individual. And so some countries have gone to the strategic prescribing massively statins, it hasn't worked because there has been no compliance because... And no compliance is a reasonable choice. People were suffering because of the statins, they stopped taking them. So every action we have has a cost, it's not only financial. It may be resourcing the doctor time, it may be comfort from the individual. And those are the difficult questions. 

Jon Krohn: 01:10:43
Absolutely. Yeah, really great examples there. And yeah, so maybe something for listeners to be thinking about, a place that you could be making an impact as well in these medico-economic problems. As a final topic area for us today, this is something that when you and I were discussing what we would cover in the episode from all of the possibilities, you highlighted this as something that might be of big interest to our technical audience, which many of our audience members are, so this is that. And so this follows on from a NeurIPS paper of yours from last year, from 2022, and so I'll be sure to include this paper in the show notes. But the idea is that there are gaps in knowledge, and of course, with anyone. I mean, we're never an expert at everything. And data science projects are often complex. They involve lots of different stakeholders, lots of different data types, maybe multiple objectives. And so how can we steer the whole discipline of machine learning into being more informed around the appropriate metrics for evaluating the problem that they're solving? Statistical thresholds, validation methodologies, sample sizes, and so on. 

Gaël Varoquaux: 01:12:02
So the paper you're referring is probably not in NeurIPS, but I have a few papers in this direction. So it's a very important question. I'm convinced that in many application setting, what's most important is not the model you use, but how you use it and how you evaluate it. And it's a difficult question. In a sense, computer scientists are not very comfortable with this, because it requires actually going outside of computer science, and understanding the matching between what you do and the benefit that you can bring. And typically, this is something that you can only do if you have understanding of the landscape in which you're going to apply your problems. So typically, you either need to train yourself in this landscape or to work with domain experts. There's not going to be a magic bullet. And then what you can do as a data scientist, as a machine learner, is understand your metrics really well, and this is what we've been trying to do.

01:13:06
We've been trying to write a bunch of papers that explore those trade-offs, and explain them so that you can match what the domain expert is telling you to something that you can measure. And here I would say there are two situations in data science. One is you're literally embedded into the system, and you're connected both to the inflow of data and the decision making. This happens in Netflix recommendation, it happens in online retail. When you do this, you can modify your pipeline and measure the response, and you can start measuring whether what you're doing is beneficial or not, and you have no data set shift between the data you're working on and the actual target data. And then there is the other setting, which is the offline setting. Typically, where we at in healthcare, where you're not making automated decision, you're not actually currently in most settings, you're not actually feeding back directly to the system. You have a high operational risk. If you mess up, you're not going to lose money, you're going to kill people. And so in this setting, you can't just try things out, you can't A/B test your process.

01:14:37
You might also very often have a distributional shift. You can get some of the data, but for reasons that might be complicated, you're missing some of the data. And maybe actually there's a big distortion between the two. Those are the places where historically, machine learning has had less impact because it's much harder. And in terms of R&D, I think there is a lot to be done there. It's very hard. It's around distributional shift, it's around understanding the mismatch between the error you have as a machine learner and the utility function, and the expected utility function that you can have when you apply this. I think it's fascinating. 

Jon Krohn: 01:15:21
Yeah, there's obviously a lot that could be covered here. There's so much opportunity for people to be having more statistical rigor and thoughtfulness in the applications that they're getting involved in. And I think as we move towards a world where people want more and more AutoML tools to make decisions for them, there's a risk that you can end up, say, drawing conclusions about some healthcare problem, where you're not considering the false positives or false negatives correctly, and you end up having a model that looks really good from an AUC curve perspective, but you are missing a bunch of patients that should have been diagnosed for... 

Gaël Varoquaux: 01:16:06
So that's the interesting thing, and you said should've been diagnosed, and that's exactly the interesting thing. It may not be that what matters is the balance between the false positive and the false negative, it may be that in the positives, some are more important than others, and then that's the hard problem. Because the balance between the false positive and the false negatives, we can talk about this, but that's something that we can tackle with the tools we have today. Now, if some of your cancer patients are actionable, you can actually do something about those people, and some of them are not actionable, and you detect the non-actionable ones but you don't detect the actionable ones, you're not bringing health value. 

Jon Krohn: 01:16:42
Right, right, right, right. Yeah, really good way to summarize that point. I was wondering if you have any thoughts on... So you were talking about how there's no magic bullet to a successful application, and you need to either develop domain expertise or work with a domain expert. I wonder when you think about tools like GPT-4, particularly the advanced data analysis for its capacity to be... It can sometimes function, obviously there's limitations, and it's obviously not as good as a real domain expert. But if you're at a loss for a real domain expert, from both getting statistical expertise as well as potentially expertise on whatever application area this statistics is being applied in, it is wild how this tool is now available to provide, in some cases, maybe some kind of magic bulletiness. 

Gaël Varoquaux: 01:17:34
It's wild. ChatGPT is wild, but I think the genius of ChatGPT is to actually not solve the problem and put the responsibility of solving the problem in the user. There's actually no product or every product, and that's genius. I think in my opinion, ChatGPT is closer to a super Google search, where actually it also has problem because of making things up, but it's improving, than actual automated decision or automated recommendation. So the question I have to you is you're saying, "Okay, it's a proxy to a domain expert, but is it better than a Google search?" It's faster, but is it better than a Google search? And if you want to go into a domain you don't know, and you think that you have a good machine learning tool that can improve this domain, then you're probably going to want to Google a lot. And then we're back, and we have the same problem with ChatGPT than with Google, is it's the unknown unknown, is what you had misunderstood about the problem. And in my experience, this is typically where we fail. 

Jon Krohn: 01:18:55
Yeah, and there's some things like having the code run, compile and run, where you can say, "Okay, the code that was suggested to me by the tool definitely runs." But then you do get into the situation that you're describing right at the onset of this episode, where you can end up with... You were specifically mentioning GitHub Copilot, which is also... It's interesting, at the time of us recording today, I have an episode that went out... It's episode number 730. It's with the COO of GitHub, and it's focused a lot on GitHub Copilot. Obviously a powerful tool, and it will generally create code that runs. And so this has created or worsened the bottleneck that you have at scikit-learn, where more code can be generated, but it's the reviewing that can be tricky. Interestingly, the GitHub COO, he is really bullish on the ways that these tools like Copilot can be useful for collaboration in code review as well. But the point that I'm getting to is that just because code runs, and this is good, this is unlike say, some healthcare advice where the advice you get there's no equivalent of compiling and running. 

01:20:14
That you know, "Okay, at least this advice works." So you're completely at the mercy of the recommendation. So yes, some Google searching, or actually consulting an expertise, or digging up resources in a library, God forbid. You need to be doing that with the application area, with whatever domain you're applying things in, at least in data science or software development, the code runs or doesn't, and so that gives us some kind of reassurance, but it also... This is now where I'm finally tying back to your comment right at the beginning of the episode, that just because it compiles doesn't mean it's the right approach. And when you see code and it looks like it runs well, as the code reviewer, you might think, "Okay, good enough," when in fact, more thoughtfulness could have gone in, and maybe it created more efficient, or a process with more statistical rigor. 

Gaël Varoquaux: 01:21:17
So I think machine learning... Unfortunately, by the way, I love machine learning, but I think once again, it works really well when the operational risk or additional cost of error is small. And there is a situation where it's really small is when you check easily whether it's correct or not. So if I generate a code that I can very easily check is correct or not is good or not, then Copilot is amazing, and that's where I use it. But when I generate a code where you need to worry about corner cases, about numerical stability, and things like this... 

Jon Krohn: 01:21:52
Yeah, yeah, really good explanation. A really good dichotomy there for situations that... Yeah, obviously low risk, you should feel more comfortable using it, higher risk, less comfortable. All right, so fascinating episode Gaël. Thank you so much for taking the time. Before I let my guests go, I always ask for a book recommendation. Do you have anything for us? 

Gaël Varoquaux: 01:22:17
I'm currently reading a book that I enjoy a lot. It's not about tech at all, it's a book by David Graeber, it's known as The Dawn of Everything. And so it's David Graeber and another person. So David Graeber is an anthropologist, was an anthropologist, and the other person was working on ancient history. And it really revisits evidence about social structures through history, and importantly outside of the history we typically know, outside of the history we have written traces. So First Nations in North America, pre-agricultural societies in the Middle East. And what I like about this book is it forces us to think about how groups of individual organized in terms of relations of power, and ownership, and basic economy. And I'm not exactly sure where they're trying to go. I think where they're trying to go is to say nothing is ever granted, and we're looking at things through a prism of a society we know, but things are not obvious. 

Jon Krohn: 01:23:45
Yeah, it sounds like a fascinating recommendation. It reminds me a bit of my favorite book, which is probably a bit of a cliche for a favorite book, but it's Yuval Noah Harari's Sapiens, but you're talking about these... So I haven't read the book you're suggesting, but in Sapiens, it does this great job of forcing you to reflect on how we live in a particular time where so many of the assumptions that we make about... Obvious, the humanism is the big point that he makes. It's like humanism is, it's like the religion of our time, the secular religion of our time. Nothing is more important than human life, and that seems like a fundamental obvious law, but it isn't, and it hasn't always been that way, and it is not necessarily going to be that way in the future. 

Gaël Varoquaux: 01:24:38
So David Graeber looks at slaves. It's interesting to think about... There were societies with slaves, what does it teach us about ownership, about relationships between people? What's the economics of slavery? 

Jon Krohn: 01:24:57
Yeah. So it sounds like fascinating read and would give us... Yeah, it would be an eyeopening perspective on how different cultures can be, and maybe provide us with some ideas of how we can be improving things in years to come with things like open-source tools and computing. 

Gaël Varoquaux: 01:25:18
Well, that's where it actually segues into open-source, because one of the themes of the book is ownership is a complex notion, and it's actually true. And ownership and organizations of people to survive together, hey, that's what we do in open-source. 

Jon Krohn: 01:25:36
Right. Yeah, glad to tie it all together. Beautiful. All right, Gaël, so I'm sure lots of our listeners absolutely loved hearing from you today. How can they follow you after this episode to continue to get your thoughts?

Gaël Varoquaux: 01:25:49
So historically, I was very active on Twitter. I don't really love the way it's going, so I still use it a lot, I've been trying to use other platforms. I'm on Bluesky, on Mastodon, I'm using LinkedIn a bit more. I don't know where I'm going here. And I think we have a complex equation, because the fusion of information is very important, but when it gets too biased in ways that are controlled by how appealing the information is rather than how right information is, is a danger for society. And I don't have an answer to this, but it's an important question. 

Jon Krohn: 01:26:35
For sure, yeah. I mean for me, this social media feed moment from more than 10 years ago now, I think, starting with Facebook of prioritizing not by recency, but by how likely it is to engage you. I mean, I guess we could go on for a whole other two hours about the issues that that has caused in our society today. But yeah, I'm a big fan of... I wish I could just see all the posts on my social media feed by recency, that would be awesome for me. And actually, I recently heard that in Europe, that could be an option soon that is required by law. 

Gaël Varoquaux: 01:27:12
It's complicated because on Mastodon and Bluesky, there's more of this, and I find I'm less entailed. So back to my own problems. 

Jon Krohn: 01:27:23
Yeah, exactly. 

Gaël Varoquaux: 01:27:24
So I don't know the answer here. 

Jon Krohn: 01:27:25
Nice. All right, well thanks again, Gaël, for taking the time. It's been an amazing episode. And yeah, maybe in a few years we'll catch up with you again and see how the Scikit project is coming along. 

Gaël Varoquaux: 01:27:35
Thank you very much, Jon, for this amazing time. 

Jon Krohn: 01:27:43
Wow, what an incredible experience to meet Gaël in Paris at a distinguished institution like the Sorbonne, and have that inspiring conversation with him, I hope you enjoyed it as much as I did. In today's episode, Gaël, filled us in on how scikit-learn began as a fork from the SciPy project in order to make a memory efficient implementation of support vector machines available in Python. He talked about how the collaborative diverse team developing scikit-learn has led the package to becoming an executable machine learning textbook that covers the full range of ML approaches. How cuPy integrations, now being integrated into the project, support 10x speedups of scikit-learn operations on GPUs, how you yourself can get started on contributing to open-source software projects with the new Skrub, S-K-R-U-B, project for preparing data for downstream ML modeling. And yeah, it's a really particularly promising place to get started for you. 

01:28:32
He talked about how his Soda Lab has made big societal impacts with data, including during the COVID pandemic, for diabetes and on medico-economic problems. He also talked about how there is no magic bullet for the successful application of statistics or ML to real world problems, and so you need to develop domain expertise yourself or collaborate with someone who does. As always, you can get all the show notes, including the transcript for this episode, the video recording, any materials mentioned on the show, the URLs for Gaël's social media profiles, as well as my own at superdatascience.com/737. Thanks to my colleagues at Nebula for supporting me while I create content like this Super Data Science episode for you. And thanks of course to Ivana, Mario, Natalie, Serg, Sylvia, Zara, and Kirill on the Super Data Science team for producing another fantastic episode for us today.

01:29:20
For enabling that super team to create this free podcast for you, we're deeply, deeply grateful to our sponsors. You can support this show by checking out our sponsor's links, which are in the show notes. And if you yourself are interested in sponsoring an episode, you can get all the details on how by making your way to jonkrohn.com/podcast. Otherwise, please share, review, subscribe, and all those good things. But most importantly, just keep on tuning in. I'm so grateful to have you listening, and I hope I can continue to make episodes you love for years and years to come. Until next time, my friend, keep on rocking out there, and I'm looking forward to enjoying another round of the Super Data Science Podcast with you very soon. 

Show all

arrow_downward

Share on