80 minutes
SDS 619: Tools for Deploying Data Models into Production
Subscribe on Website, Apple Podcasts, Spotify, Stitcher Radio or TuneIn
Erik Bernhardsson invented Spotify’s original music recommendation platform, built Modal Labs to help engineers build and scale tools, writes algorithms that play chess and design fonts—and he’s just getting started. In this episode, Jon Krohn and his guest address the different ways to interview a candidate, how to deploy a data model into the cloud, and the approach he took that made Spotify go from digital music startup to AI-driven streaming giant.
Thanks to our Sponsors:
Interested in sponsoring a SuperDataScience Podcast episode? Visit JonKrohn.com/podcast for sponsorship information.
About Erik Bernhardsson
Erik Bernhardsson is the founder of Modal Labs which is building better infrastructure and tools for data teams. Before Modal, he was the CTO at Better, and built the music recommendation system at Spotify. He has been programming since the early 90s, and has spent most of his career working with data.
Overview
From boxing to breathing techniques, people have developed all kinds of personal methods to work through their frustrations. Erik Bernhardsson’s is particularly novel: he invents code.
This was how Erik’s business Modal Labs came into being. He wanted to get rid of the infrastructures that obstructed him and his team from building and scaling new tools—and he succeeded. Erik’s ambition is for Modal Labs to take teams beyond the typical confines of infrastructure and help them ship data science solutions directly into the cloud. And he believes that Modal Labs marks the beginning of a new era for solutions that will serve and support engineers.
Given his decades of industry experience, Jon also quizzes Erik on how he feels about the interview process for data scientists. Several approaches are available to interviewers, and Erik and Jon discuss the benefits and pitfalls of such techniques as whiteboard questions and code readings. The conversation inevitably then moves to what Erik himself looks for in candidates. Erik advises on the differences between what he looks for in an individual contributor versus a manager. But he also notes that data scientists should all exhibit the drive to investigate what the data are telling them.
This drive to “uncover the truth” led him to build the enormously successful music recommender system for Spotify. Until Erik joined, the fledgling Swedish company allowed users to search for their favorite artists or albums but not to find similar songs or groups. Erik transformed this basic function by making the existing algorithm much more rigorous, first applying the number of times that someone listened to a song as a guide for how much they enjoyed it and then turning to unsupervised models to populate a matrix of users, songs, albums and artists. With this matrix, Erik’s team built a vector database that could identify similar tracks by spatial distribution. Vectors are a significant research interest for Erik and Jon in their respective businesses, and they discuss how vectors could easily be the future of data science development. Erik adds that quantifying uncertainty is another challenging yet lucrative area and one that is, therefore, also ripe for future study. Finally, he teases us with his current investigations into the future of the cloud and when we might expect to see his novel thesis on cloud adoption appear.
In this episode you will learn:
- The data problem that Erik’s company Modal Labs solves [04:32]
- Erik’s prolific blogging career [09:15]
- Opportunities for making data teams more efficient and productive [14:42]
- Erik’s views on interviewing data scientists and software developers [20:18]
- Erik’s tips and tricks for data science interviewees [31:35]
- How Erik built Spotify’s original music recommendation system [38:58]
- Applying vectors to other tools, and opportunities for working with vectors [47:45]
- Using Annoy to search across vectors [50:57]
- Building Python module Luigi for Spotify [55:20]
- The tools that Erik loves to work with [1:06:23]
Items mentioned in this podcast:
- Datalore - Use the code SUPERDS for a free month of Datalore Pro, and the code SUPERDS5 for a 5% discount on the Datalore Enterprise plan.
- Zencastr - Use the special link zen.ai/sds and use sds to save 30% off your first three months of Zencastr professional. #madeonzencastr
- Bunch
- Modal Labs
- SDS 585: PyMC for Bayesian Statistics in Python
- SDS 601: Venture Capital for Data Science
- SDS 616: The Four Requirements for Expertise (beyond the ‘10,000 hours’)
- Annoy
- Luigi
- Pulumi
- Analyze how a Git repo grows over time
- Giving More Tools to Software Engineers: The reorganization of the factory
- How to Hire Smarter than the Market: A toy model
- The Half-Life of Code & the Ship of Theseus
- Generate fonts using deep learning
- Analyzing 50k fonts using deep neural networks
- Deep learning for... chess
- Alfredo Canziani YouTube Channel
- Stanford marshmallow experiment
- Dunning—Kruger effect
- Hard Landing by Thomas Petzinger Jr.
- The Smartest Guys in the Room by Bethany McLean and Peter Elkind
- Losing the Signal by Jacquie McNish and Sean Silcoff
- Lights Out by Thomas Gryta and Ted Mann
- Open Data Science Conference (ODSC) West
- Prof. Dawn Song
Follow Erik:
Podcast Transcript
Jon: 00:00:00
This is episode number 619 with Erik Bernhardsson, founder and CEO of Modal Labs. Today's episode is brought to you by Datalore, the collaborative data science platform; by Zencastr, the easiest way to make high quality podcasts; and by Bunch, the AI driven leadership coach.
00:00:23
Welcome to the SuperDataScience Podcast, the most listened to podcast in the data science industry. Each week we bring you inspiring people and ideas to help you build a successful career in data science. I'm your host, Jon Krohn. Thanks for joining me today. Now, let's make the complex simple.
00:00:54
Welcome back to the SuperDataScience Podcast. We've got an exceptional episode for you today with the iconic Erik Bernhardsson. Erik is the founder and CEO of Modal Labs, a startup building innovative tools and infrastructure for data teams. Previously, he was CTO of the real estate startup Better.com, where he grew the engineering team from the size of one, himself, to over 300 people. He was also previously an engineering and manager at Spotify where he created their now ubiquitous music recommendation algorithm. He's a prolific open sourcer, having created the popular Luigi and Annoy libraries amongst several others. He's an industry leading blogger with posts that frequently feature on the front page of Hacker News.
00:01:40
Today's episode gets deep into the weeds at points, so it will be particularly appealing to practicing data scientists, machine learning engineers, and the like, but much of the fascinating wide ranging conversation in this episode will appeal to any curious listener. In this episode, Erik details how the Spotify music recommender he built works so well at scale. He'll also talk about the litany of new data science and engineering tools he's excited about and things you should be excited about too, what open source library he would develop next, why he founded his startup Modal and what their tools will empower data teams to be able to do, and having interviewed more than 2,000 candidates for engineering roles, his top tips both for succeeding as an interviewer and as an interviewee. All right, you ready for this awesome episode? Let's go.
00:02:33
Erik, welcome to the SuperDataScience Podcast.
Erik: 00:02:35
Thank you.
Jon: 00:02:36
It's awesome to have you here and in my apartment, it makes filming so much fun for me. I understand you had a nice easy walk over here.
Erik: 00:02:43
I did. My office is two blocks away from here and I also live five blocks from here, so it's very convenient for me.
Jon: 00:02:49
Super convenient. Well, we set all this up for you. I moved into this apartment originally with this in mind years later.
Erik: 00:02:55
Is that why it's so clean in here?
Jon: 00:02:59
Exactly, it's really, it's a prop.
Erik: 00:03:00
It feels like you just moved in it. It's so clean.
Jon: 00:03:05
I know you through Sarah Catanzaro.
Erik: 00:03:08
Yeah, she's great.
Jon: 00:03:09
She is really great. She did episode number 601 of the podcast. If listeners are looking for an episode on venture capital in general, or particularly if you're interested in venture capital applied to data science companies, incredible episode, she was so generous with her time. It blows my mind that somebody who is a general partner at a major venture capital firm makes that kind of time to create such an amazing interview for the audience.
Erik: 00:03:36
I think Sarah plays a long game. She has a fantastic network. I took money from Sarah, so I know her quite well.
Jon: 00:03:44
Does she know that?
Erik: 00:03:47
No, I haven't told... But I really think that she stands out and in terms of having a huge network and always willing to connect and intro to people to people.
Jon: 00:03:58
You are the third in a trio of incredible guests that I recently had on the program all recommended by her. We had Emre Kiciman in episode 613, Sean Taylor in episode 617, and then now Erik the final of three for now from Sarah. I'm sure she might have more. I'll try to squeeze her.
Erik: 00:04:19
I'm sure.
Jon: 00:04:21
Absolutely fantastic guests. You are currently the CEO of Modal. It's a new startup. And so, what pressing data issue does your company Modal solve?
Erik: 00:04:35
I mean, I like to say purely selfishly, I built Modal to solve a bunch of stuff I always wanted working with data for, I don't know how many years, 15 years maybe. Modal is a way to basically get rid of a lot of traditional infrastructure that I find gets in the way of me actually building things and scaling it out and productionizing things and scheduling things.
00:05:01
The idea is it's like serverless data execution and in the same vein as maybe Lambda or Vercel, maybe what Vercel is doing for front-end team. Actually fun, I met Guillermo last night, the CEO of Vercel. In a similar way, I feel like there's an opportunity to build something for data teams where you don't have to think about infrastructure where you can just write code and ship it into the cloud and it just works. That's sort of aspirational of what I've been working on.
Jon: 00:05:28
That sounds really exciting. I think we're going to get into later in the episode talking about some of your open source projects. This sounds like a theme throughout your career. You identify major problems, things that you would like to have, tools you'd like to have, and then you invent them.
Erik: 00:05:42
I feel like I do it almost like out of spite sometimes. I'm like, "This is annoying. Why isn't there a better tool?" Then I'm like, finally, I give in. I'm like, "I guess I'll have to do this then." I think that's been the common theme throughout my career is I never wanted to build anything of this, but I felt like I had to because there was a gap.
Jon: 00:06:05
Nice. It sounds like you're onto a gap here with Modal being able to, as you say, just ship data science solutions into the cloud relatively seamlessly. That is a holy grail.
Erik: 00:06:17
I think so too. I think so too.
Jon: 00:06:22
Besides data engineers who seems like an obvious candidate for your target market, who else is in your target market for this tool?
Erik: 00:06:29
I mean, I think it's actually very general in the sense that we're building something that arguably aims to replace Kubernetes, which is a big audacious goal. In that sense, I think anyone who uses Kubernetes really, but in particular, we tend to focus right now on the types of problems that data teams have. And so, it's primarily data. I don't know, there's like 500 titles today, but data engineers, data scientists, machine learning engineers, maybe analytics engineers, maybe I don't know, whatever new role there is today.
00:07:02
But it's a fairly general product in the sense that I think for this stuff that you need to write in code, which I think there's always going to be stuff you're going to have to do. I mean there's amazing stuff you can do with SQL today. I'm a big fan of data warehouses in general, but there's always stuff you're going to have to write code for. Today, when you write code, it's hard to productionize it and scale it up and you have to deal with Docker and Kubernetes and I don't know, all these other stuff, Helm charts and Terraform and AWS, and what if there's a way where you can just work with the cloud where you just feel like it's just your local computer.
Jon: 00:07:36
Kubernetes is designed to work really well with Docker containers, which you just mentioned. Does Modal work with Docker containers or it allows you to circumnavigate them entirely?
Erik: 00:07:46
You can use Docker containers with modal, but we don't use Docker. I mean Docker containers are of just same thing as OCI containers, which is the underlying file format. Modal's essentially uses the same idea of containers, that format of here's a root file system. What is a container? Container's like a root file system. It's the slash directory on Linux and then the whole root file system. In the same way, we basically let you package those things and execute them in the cloud. Containers, there's like the run time side of it, which is there's a bunch of chroot and seccomp and a bunch of stuff. For a lot of stuff, people use virtual machines too. That's the other side of it.
00:08:28
We've had to build all of these things. We've had to build our own file system. We've had to build our own container run time. We had to build our own container builder, which actually uses the Dockerfile syntax, because I think Docker, it has, for what it's worth, become a little bit of a standard file format or whatever you want to call, any change format for containers.
Jon: 00:08:50
We talk about it all the time on the show. Anybody who's come on and talk about MLOps, Docker is one of those tools that they're like, you got to know Docker, you got to know Kubernetes, at least for now, but it sounds like maybe in the future, people will be talking about Modal instead.
Erik: 00:09:03
My life goal is no one should have to write Yamo or Dockerfiles in the future.
Jon: 00:09:10
Nice. In addition to your company, you are renowned for your blog posts. I've known about you personally for many years because of your blog posts. Very frequently, they make it to the top of Hacker News. And so, when Sarah mentioned that she could introduce me to Erik Bernhardsson, I was like, "Please, immediately." I got you on the show as quickly as I could. An example of a great blog post of yours is one on industrial bottlenecks and why software engineers need better tools. We'll have a link to that blog post in the show notes. Can you elaborate on this particular blog post for us?
Erik: 00:09:49
Yeah, I'm actually very glad you brought that blog post up because it's one of the plug posts that I feel like I thought the most about but actually never made it to Hacker News, so I'm glad it's-
Jon: 00:10:01
An underground pick here on SuperDataScience Podcast.
Erik: 00:10:02
Yeah, you picked the underdog. I think the thesis of that one was that I feel like if you look at engineers and there's actually multiple thesis in that blog post, but if you look at engineers, there's this, I don't want to call it paradox, but engineers, as engineers get more productive and as engineers get better tools, they actually end up, the demand actually goes up because suddenly, there's this whole new thing, a whole new set of use cases for engineers. I mentioned a lot of examples in the blog post, but 20 years ago, it wasn't worth building customized website management system for dentists. But now, I think there's a bunch of startups doing that. Because the cost of building a startup to serve the dentist market is low enough that you can actually do that.
00:10:52
I think a lot of it comes back down to better tools. Engineers have better tools today. The other thing that I think is interesting, the flip side of that is I look back historically, demand for engineers have often been expressed in terms of demand for no code tools. Back in, I don't know, or the early 2000s for instance, everyone was raging about Dreamweaver or whatever, it's maybe not the best analogy, but people were saying, "Oh, maybe not. Is this the end of html? Are people even going to have to write html in the future?" We're going to have all this WYSIWYG tools and people just build websites and they're not going to need engineers for that anymore.
00:11:36
As it turns out, those tools still exist. I mean, there's, what's it called, Squarespace and Webflow and a bunch of those tools, but if you look at the number of front engineers, I bet it's like 100X more in the world today than it was 20 years ago. Clearly, there was a huge need for engineers. And so, I think the same thing could be said with many other tools. There's always this tension where there's a lot of demand for engineers and then people try to invent tools to go around engineers. Then, they point to those as saying engineers are no longer going to be needed. But in fact, I think what generally tends to happen is then you give better tools to engineers, then they become more productive. The demand for engineers actually goes up. I don't know. I think give it 10, 20 years, I think we're going to have 10X more engineers doing 10X more stuff at 10X less the cost due to better tools.
Jon: 00:12:26
I agree 100%. And so, then, the connection that I would like to try to draw here is between Modal and that blog post with the idea that a tool like Modal allows data scientists, for example, to be able to push things to production without necessarily having all of the engineers that you might need today or you might have needed a few years ago to get that into production and be performant.
Erik: 00:12:49
Yeah. And so, I think that I'm glad you made that connection because I think it's a really profound one. I think that data teams, I was a CTO for many years, so I managed a lot of different teams and now I think it's fair to say, hope no one takes offense, but I think data teams are just not quite where other teams are in terms of productivity. These are the tools they're using and these are workflows they're using, if I look at front engineers and backend engineers, I see more of a mature set of tool-chain. It's less fragmented. People between companies generally tend to argue and there is an emergence of certain things in the data team. I mean, I think dbt is a good example. There's more of a standardization. Airflow may be an example 10 years ago.
00:13:30
But I still think that data teams struggle a lot with infrastructure, they struggle a lot with the tools they're using, whether it's tens of thousands of SQL queries each that are 100 lines long and how to maintain that. Another manifestation of the same thing is a lot of data teams, a lot of companies when they tend to get big enough, they end up building their own internal data platforms, which to me is a less direct waste or lack of productivity. But if you think about how many thousands of companies each are building their own data platforms, that is a lot of wasted effort. Someone should just build it for them and sell them. That generally tends to be what tools do for you. Instead of all these companies building half-baked tools, what if centralized infrastructure at this point is sold back to them?
00:14:19
I think on the other hand though, you mentioned in the future, women may not need as many data engineers. I actually think going back to what I said about the industry or industrial changes, making data teams more productive may in fact drive up the demand for data engineers.
Jon: 00:14:37
Just not on that specific problem. It frees them up from... The data scientist who now no longer needs to be yolked to the data engineer to get their solution into production, that data scientist can be making data tools that some other data engineers can use I think is your point and so it feeds into this ecosystem.
Erik: 00:14:54
Exactly, and maybe the business teams need it. I'm sure if you went to the average CFO at the average company and said, told them, "Here's 10 data engineers. What can they build for you? Let's automate your entire financial systems." He would be or she would be amazed. That's just one example. I'm sure if you go around just looking at different aspects of the average web e-commerce site or web product or whatever, there's just so much stuff where you could maybe incorporate more machine learning, personalization, relevance. There's so many untapped, I mean this goes for engineer as a whole, but I think especially for data teams and data engineers and data scientists, I think there's an incredible latent demand where if you lower the cost, suddenly we can have so much more data being leveraged throughout products and companies.
Jon: 00:15:49
Today's show is brought to you by Datalore, the collaborative data science platform by JetBrains. Datalore brings together three big pieces of functionality. First, it offers data science with a first class Jupyter Notebook coding experience in all the key data science languages, Python, SQL, R and Scala. Second, Datalore provides modern business intelligence with interactive data apps and easy ways to share your BI insights with stakeholders. And, third, Datalore facilitates team productivity with live collaboration on notebooks and powerful no code automations. To boot with Datalore, you can do all this online in your private cloud or even on-prem. Register at datalore.online/sds and use the code SUPERDS for a free month of Datalore Pro and the code SUPERDS5 for a 5% discount on the Datalore enterprise plan.
00:16:40
We have these winds at our back, several of them. One of them is more abundant data tools, cheaper data tools, and then in addition, we have cheaper data storage, way more sensors collecting way more diverse types of data, and then compute being cheaper as well. With all of these tailwinds, I mean yeah, I agree with you totally that in 10 years, I think we'll have 10 times as many people working on software problems related to data in some way.
Erik: 00:17:11
Totally. When I think about tools, I think a big part of why I started Modal wasn't necessarily like I want to build this low level container runtime. I started with the question of looking at data teams. It seems like they could be doing things more productively or they could be more productive, they could be more efficient. And so, looking at the different aspects of how they worked or how they operate, I ended up realizing I want to start at the lowest level. But I think it's always been with that end goal, how do I make data things more productive? I think that's in general how to think about tools. Tools is all about making engineers or other people more productive.
Jon: 00:17:51
This might be a dumb question, maybe I should have done research into this before I asked it. What does the Modal name mean? Where did that come from?
Erik: 00:18:02
The long story, the real story is that originally, I started building a tool that looked like multi-processing, if you're familiar with multi-processing, and the idea was I wanted to build multi-processing, but instead of processes, it actually moves the computation to the cloud, but it preserves the same API. For various reasons, I started calling it Polyester because my thinking is polyester is a thread. It's like fabrics, it's that analogy and then poly multi. Then, at some point, I realized I don't like the term polyester. It sounds weird.
00:18:34
I at some point just Googled types of fabric and Modal turns out to be a type of fabric. We picked that name because of that. It also sounds good. I don't know. Modal also means 35 other things. There's modal jazz. There's modal logic. There's mode of a statistical distribution. There's multi-modal transport. There's all these other things which I like. I feel like that's a general 2020s, 2010s trend, like companies who just pick a weird term and noun that doesn't really mean anything but means a lot of things.
Jon: 00:19:08
I mean, so you have now explained it is related to this idea of fabric, and so, there is something clever there. I thought it might be something obvious that would be embarrassing to me, that some computer science thing you're like, "Oh, you don't know Modal? Idiot."
Erik: 00:19:18
No, no, no, no, no, no, no. The funny thing is I've been meaning to create custom T-shirts for my team. Every time I want to do it, I always get hung up on. No, it has to be modal fabric, but I can't find any manufacturers that actually do custom T-shirts made of modal.
Jon: 00:19:37
What does it mean? What is it-
Erik: 00:19:37
It's some sort of type of fabric made of beach, something like beach trees.
Jon: 00:19:41
Oh, beach trees. Oh wow.
Erik: 00:19:42
I think maybe something like that, supposed to be very smooth.
Jon: 00:19:47
Cool. Well, we'll have to get our hands outside. Before Modal, you got into this world of fabric, you were CTO of another startup for six years and you grew the technology team there at Better from 1 to 300 people on the technology team, which is crazy. You interviewed over 2,000 people. It's a lot of people.
Erik: 00:20:08
A lot of people.
Jon: 00:20:09
Our listeners often want to know what kind of interview tricks are out there, what they should be looking for. What did you learn from so many interviews and what could our audience learn from that? Are there low-hanging fruits that aspiring engineers, data professionals could focus on when they're heading into an interview?
Erik: 00:20:32
I think that's maybe two separate questions. One is how to evaluate candidates from an interviewer point of view. The other one is how do you impress the interview? I think if there's anything I've learned from interviewing thousands of people is that interviewing is just incredibly hard. Having hired a lot of people that I've interviewed and then thinking back who did I think would be good versus who was actually good, I actually realized it's a very hard prediction problem. That I think is at the core of what interviewing is about. You're getting a chance to meet someone for four hours or whatever and you're trying to assess, you're to make a prediction, is this person going to perform in my organization? That's what it comes down to. That turns out is a very hard low signal to noise problem. You meet someone for four hours and you have to make a very hard decision.
00:21:26
If there's anything I learned, it's there is no trick really. You just have to collect a lot of signals and try to combine that in some way. It's funny because some people are like, "Well, you shouldn't do this, you shouldn't do that." People have very strong feelings about how interviews should be conducted. "You shouldn't do whiteboard interviews." "You shouldn't do whatever." "You shouldn't do take home or you should do take home." I don't know. To me, it boils down to I don't care. If someone told me that whether someone can spot a card trick or if someone could... It doesn't matter. If someone told me there is someone-
Jon: 00:22:00
If someone could spot a card trick.
Erik: 00:22:02
... absurd thing that has super high correlation with their future ability to do well in my organization, I think that's a good signal. That's what it comes down to. Now, talking about signals, I think you can reverse engineer a little bit and think about what are the things that generates the most signal per time. Because I think that that's how I think good heuristic for how to design interview processes. I tend to think that, for instance, whiteboard questions are bad because at the end of the day, at the end of that, you spend an hour doing a whiteboard question, you get one bit of information. Did they sell this thing or not? Maybe you get a little bit more, but it's not time efficient versus how much information you get.
00:22:47
I tend to, for instance, favor code reading instead. The reason is that if I write down 20 code samples or 10 code samples or whatever and go through with the candidate, can you spot the bug? How does this work? Can you explain how to optimize this code or things like that. You can actually cover a lot of terrain very quickly. That is I think at least how I think about ways you can get more signal per time out an interview, but then obviously boils down to... also comes down to what role you're hiring for and things like that.
Jon: 00:23:23
That is interesting. For data science interviews on my team, we do use a whiteboard a lot, but it is solving a lot of different problems. We have the candidate work through a lot of different phases of data science model development. I'll present a problem and say what kinds of data are out there? How would you start on this problem? Then you can keep going down the line of okay then, how would you create a model with this? What kind of model would you have? I find with at least for data science interviews with a whiteboard over an hour, I can cover from conception through to getting into production a data science model and dig deep on parts that I find interesting as an interviewer. I might even learn some stuff.
Erik: 00:24:13
I think that's very smart for two reasons. One is that you're covering a lot of different things, so it's to decorrelate your signals. Going back to how to think about signals, you want to find a lot of signals that are independent and then you combine them, then you reduce the variance. Whereas if you go all in on one skill, then you may get more variance.
Jon: 00:24:31
Fizz buzz.
Erik: 00:24:32
Exactly. The other reason why I think it's good is that I tend to think what you're doing, what you're describing, it sounds like you're doing more filtering for more goal-oriented type person. I think that's actually something I've realized is that it depends on the role for sure. There are different types of role that need different types of things, but depending on the role, you might need either someone who's more tool-oriented, who they're very good at their craft, they know Kubernetes or Docker, whatever, super well.
00:25:00
But for a lot of roles, you actually don't want that. You want someone who fundamentally comes in every day and there's kind of does whatever has to get done in order for the business to get its results. I think talking more from a business context I think is very important specifically for data scientists, not always, because there are some data scientists who are all about just squeezing couple more percentage points out of accuracy, if you're running some, whatever, speech recognition model for Alexa, those people might be super valuable. But for 99% of data scientists, I think, you want to focus on people who are more business-focused and care about the end result. I call it almost like data journalism. It's almost like a pursuit of the truth. They think of their job as they want to uncover the truth buried in the data. That's what fundamentally gets them going.
Jon: 00:25:52
It's interesting, you've talked a number of times in this interview, what you look for in interviews, explanation about signal versus noise. I think there's one other thing that makes interviewing so hard, so a couple of episodes ago, episode 616, I did an episode on the four requirements for expertise beyond just the 10,000 hours thing that Malcolm Gladwell talked about, people like to talk about.
00:26:17
And so, one of the things that you have to have is you have to have what they call a valid environment, which is this idea of there being meaningful signal relative to noise. The smaller the signal is relative to noise, the harder it is for there to be anything to learn about, to become expert in. But then, the other thing that makes hiring so hard to become expert at is the temporal lag between making the decision and finding out whether it worked out or not.
00:26:47
The kinds of things that you can become expert at tend to be things that you can learn immediately. "Oh, I made the wrong chest move." You can see right away. "Oh, I answered this math problem wrong." You find out right away. Whereas with hiring, I mean, the argument that I made in that episode was actually literally with hiring, I said, "You can't really become expert in hiring because the time lag is so long. It's months, years, before you know." And so, even somebody like you who has made thousands of hiring decisions because of that time lag, it becomes hard to be a true expert. You could be Magnus Carlsen with chess, it's tough to develop that same kind of expertise.
Erik: 00:27:23
Totally. I think those are two very different types of skills on that spectrum between fast feedback loops and very slow feedback loops. VC, investing in a company, how do you-
Jon: 00:27:34
Exactly.
Erik: 00:27:35
It takes 10 years to know. I think part of it is you just have to develop a strong intuition for proxy metrics. If you're a VC or if you hire a person, it might take a year in order to fully understand their productivity or in a case of VC, it might take 10 years to see them go in public, but you might be able to see early signs earlier on and that then can let you have a tighter feedback loop. But I think this is a general problem, also going from being an individual contributor to manager is something I thought a lot about and I personally struggled a lot with, is actually not that the skills themselves are necessarily different. I think the biggest thing was just shifting from a short feedback loop to a long feedback loop cycle.
00:28:18
You're so programmed. Growing up, I grew up programming. You're so trained to have this sense of immediate gratification. You come home every day and you're like, "I wrote so much code. I feel good. I saw this code and it's amazing." Then you start managing, you just come home every day and you're like, "I don't know. I just don't feel like I did anything."
Jon: 00:28:39
Right. Totally.
Erik: 00:28:40
But a few years in, you start to realize, actually I hired this person or I managed this person in my team. I had a tough conversation and then now six months later, I'm noticing she's changed her behavior. You start to recognize those things and those things become rewarding. I think doing that, it's the marshmallow test. You have to learn how to fail to win at the marshmallow test.
Jon: 00:29:05
This is a complete aside, but just this week, I read that the marshmallow test, if you control for socioeconomic factors-
Erik: 00:29:13
I've heard it too. I feel like all these psychological things, they almost turn into, what's the word, allegory. They're almost tales more than they are actually. I mean it's the same thing with Dunning-Kruger. People talk about Dunning-Kruger effect. Turns out the original study was actually had many flaws and wasn't really true. There's all these other things too, but I don't know.
Jon: 00:29:34
What's Dunning-Kruger?
Erik: 00:29:34
Dunning-Kruger is this idea that there's an inverse correlation between knowing how much you know about something and realizing how... People think of it as a U-curve. When you don't know anything, you think you're really good at it. It's actually not what it is, but roughly speaking, the way people interpret it's like when you don't know anything, you think you're good. Then as you learn thing, you realize that you don't know about it and then you finally become an expert, then you realize you actually know a lot of things about it again. Again, that's not what it is, but that's what people think it is. I think as such, maybe it's fine to refer to this, I don't know.
Jon: 00:30:10
Anyway, I've drifted you off course.
Erik: 00:30:12
That's fine.
Jon: 00:30:15
I don't know if you had a thought to finish or if you want me to move on to the next one.
Erik: 00:30:18
Maybe. Yeah, I don't know.
Jon: 00:30:20
Trying to create studio quality podcast episodes remotely used to be a big challenge for us with lots of separate applications involved. When I took over as host of SuperDataScience, I immediately switched us to recording with Zencastr. Zencastr not only dramatically simplified the recording process we now just use one simple web app, it also dramatically increased the quality of our recordings. Zencastr records lossless audio in up to 4K video, and then asynchronously uploads these flawless media files to the cloud. This means that internet hiccups have zero impact on the finished product that you enjoy. To have recordings as high quality as SuperDataScience yourself, go to zencastr.com/pricing and use the code SDS to get 30% off your first three months of Zencastr professional. It's time for you to share your story.
00:31:11
Yes, now, you've given us some insight into things that you look for when you're interviewing and your process around interviewing, this idea of having multiple tests, distilling multiple signals that we can aggregate together into some hopefully overall meaningful signal that predicts long-term success in the role. But now, let's flip a list to the other side. You rightfully pointed out that I was really asking two questions. There was part A, which is about your experience as an interviewer, but then part B is with that experience as an interviewer, what do you recommend to interviewees, to candidates that would allow them this idea of low-hanging fruit, things that they could be doing before an interview to succeed?
Erik: 00:31:51
I think to me, I mentioned this spectrum of tool-oriented versus goal-oriented. I think again, it depends what type of role you're hiring for, but I think most data teams should hire for the more goal-oriented or outcome-oriented type candidate. For that one, I think what's important to show I think, or at least I like to see as an interviewer is people being, A, autonomous, having this drive to figure things out on their own. That's not to say they need to know everything about everything, but it does help when someone comes in and they're shown that I built this data thing. At first, I had to go out and script the data and then I fit this model and then I built a FLAS gap and it doesn't have to be great, but they figured out they built something end-to-end. I like to see that. That was just one example. There's many of these examples. I wrote a blog post or whatever.
00:32:47
Something that shows that they wanted to solve something end-to-end, as opposed to someone who comes in and just has, they're showing that they can fit this model and this model and this model. Then that makes me question, okay, you like to fit models, but do you care about the business outcomes? I think that's quite important. The autonomy and the outcome, to me those are things. A lot of it I think is best manifested from showing a track record of I built these like... I'm talking more about junior candidates now. I think for senior candidates, maybe it's a little bit different, then it's more around, especially on data teams, I think it's very important. How do you work with stakeholders? How do you organize the team? How do you structure it? Who reports to who? What does your platform look like? All this is very different type of role to hire for.
Jon: 00:33:40
That makes perfect sense to me. With respect to the junior one that you were describing, that end-to-end project, somebody writing the blog post about something they're interested in, I love particularly when they have a narrative behind it, "Hey, I'm really into basketball."
Erik: 00:33:54
Totally.
Jon: 00:33:54
And so, I found this basketball data set and I was able to do this new analysis and I was able to draw these conclusions and that actually has now changed the way that I play basketball or view basketball games.
Erik: 00:34:05
Totally. I personally don't know anything about basketball or horseback riding or whatever, but if someone writes an amazing blog post about how they use data for horseback riding, and it makes me interested in it, that's cool.
Jon: 00:34:19
Speaking of hiring and blog posts, one of my favorite blog posts of yours is about this toy model that you created for finding the ideal hires. Do you want to talk us through that blog post?
Erik: 00:34:32
I think the intuition of that blog post is that everything turns into a trade-off. Actually for hiring, I think it's intuitively harder to people understand it a little bit less than, or at least I struggle to realize it less so than something else like real estate, where I feel like if you're looking for a home, people intuitively understand you're going to have to make trade-offs. You have a certain budget and that means you can't get huge tariffs and 5,000 square foot in the best prime real estate in New York, whatever. You're going to have to give up certain of those things because you have a budget that acts as a constraint. That constraint means some people price having an outdoor space. Some people price square footage. Some people prize whatever, double garage. I don't know.
00:35:24
I think the similar thing happens in hiring. People indirectly have a budget that they go out in the market. It doesn't necessarily have to correspond to a certain type of a dollar amount. It also I think could be interpreted as I have my set of candidate levels. This is the appeal of my company. This is the tier of engineers I'm able to hire. For most companies, unless you're the absolute most stellar company in the world, you're going to have to make trade-offs. You're not going to find candidates who are good at A and B at the same time. You're indirectly going to have to make a trade-off between you either get A or you get B.
Jon: 00:36:03
Now, wait a second. With you as Erik Bernhardsson running this cool new startup modal, VC backed, building engineering tools for data scientists, do you still have to choose A or B? I bet sometimes people who are the A and B do come to you and you can hire that.
Erik: 00:36:18
I like to think so, but I also think that's what you always think is a hiring manager. Everyone of us makes trade-offs, but I'm actually good at identifying. But anyway, going back to why these trade-offs are important, what I wrote about in this blog post is that if you prioritize A but actually B is the thing that's good for your business, you are actually making it far worse for yourself. And so, one example that I think a lot of people do is A is credentials. B is actual job performance ability. If you're fixated with credentials, of course, you're implicitly trading off on actual job performance. Even if there's a correlation between A and B, which there often is, because the hiring market is competitive. You are indirectly, because you're over focusing credentials, you're going to trade off and get worse candidates or vice versa.
00:37:16
If you're focusing on people who are extremely good at tool A, they're probably going to be on average worse in tool B. Part of it is like just saying you need to be very careful about what things you focus on the most. The other thing is frankly, I think it presents a bit of an opportunity. I think a lot of companies out there overhire for credentials or overhire for having certain cool companies on their past. There is an arbitrage opportunity here. If you're smart, if you want to play moneyball here, you can go and try to find the people who maybe have not the most polished resume but maybe stellar candidates because they stand out in other ways. The silver lining here is that I think it means there's a certain opportunity and it's actually good for everyone here that that opportunity exists. I encourage everyone to really think about that, because I think there's a lot of people who may not have the most polished resumes but may actually be phenomenal people that if you look a little closer, you can find those people.
Jon: 00:38:22
It's a great point. I mean, crystal clear the way that you conveyed it here in the audio only format, but we'll make sure that we have a link to this blog post in the show notes and you created lots of cool visuals to make it visually appealing in this concept as well.
Erik: 00:38:40
There's a bunch of math and this is of a lot of graphical intuition to what I just try to describe in words.
Jon: 00:38:45
Nice, but you nailed it perfectly. We've talked about what you're doing at Modal. We've talked about all the hiring that you did at Better. Prior to Better, you were famous, you are famous for building the music recommender system from scratch at Spotify. I think it goes without saying that Spotify is the most used music streaming service in the world and you built the original music recommender system. It's used by hundreds of millions of people every day in the world.
Erik: 00:39:18
Yeah, probably.
Jon: 00:39:20
That's wild.
Erik: 00:39:22
To be fair, I left seven years ago, so my joke is when the recommendations are really good, then it's like, "Yeah, of course I built it." When it's not good, I'm like, "Oh, I don't know. I left seven years ago. I don't know what you did." To be clear, I'm sure there's a lot of stuff that has happened in the seven years since I left, but I've heard from people still at Spotify that foundational, it's still my ideas and to some extent, even still my code running it.
00:39:49
I mean, it was a lot of fun. I started at Spotify was very lucky in a way. I grew up in Sweden and I knew a bunch of people from school and they went to this company, this obscure music streaming startup with this crazy idea, put all the music in the cloud and called Spotify and I ended up joining and I think 10 years later or whatever is actually, no, it's crazy, it's 12 years later. You're right. I mean, it's a massive success.
00:40:12
Having been there from scratch I think was one of the reasons I turned into a degenerate startup person, because being part of that journey and that growth was amazing, but it was a fun problem. I did a lot of other stuff at Spotify too. I did a lot of data engineering and data science and business intelligence and that stuff. But the music recommendation system was... Funny thing is actually a large part of it was skunkworks. I actually built large parts of it on weekends and evenings on my own for large parts of the time. Then eventually, I was able to convince people Spotify needs some music recommendation system that's actually productionized.
Jon: 00:40:49
Oh, wow. Up until you built that, people could search for their favorite artist, they could find this album that they want to listen to and they could listen to that, but you couldn't have a song that you liked and say, "Just start a radio on this."
Erik: 00:41:04
No, no. There was no recommendation until... There was some very basic recommendations but they weren't really smart and based on machine learning until 2011, I think, actually not quite, but roughly speaking, yeah.
Jon: 00:41:17
Are there aspects of this that you can dig into on air?
Erik: 00:41:21
Totally. Absolutely. What Spotify ended up doing that I think is the best approach when you have tremendous amounts of data, which Spotify has, is essentially what's called collaborative filtering. The idea is you have all this data about what people listen to and also what playlist people create. That data obviously says a lot about if someone listens to a lot... This sort of intuition, if you see a lot of people listening to tracks A and B, if those correlate a lot, then those tracks are probably pretty similar and the same thing on an artist and album level.
00:42:00
Now, computing all these pairwise correlations, it turned out to be very inefficient because there's like O and square, and there's like 30 million tracks. The question is, okay, we probably need something smarter. At that time, there was a competition I think, I don't know if it's like people remember it today, but it used to be a big thing back then called the Netflix price. Netflix had this big competition where they open source a bunch of data about movie ratings and then offered a million dollars to the first team that would beat the benchmark by 10%.
Jon: 00:42:31
It was solving the same kind of problem, is it?
Erik: 00:42:32
Very similar.
Jon: 00:42:33
... film recommendations.
Erik: 00:42:34
Very similar, with the exception that in the Netflix case, people gave one to five ratings. In the Spotify case, I didn't have that ratings information. I just knew when people listened to a track. I could aggregate it off and look at how many times people listened to it and get an idea of how much they like it. It's the same idea. Turns out actually, review... and Netflix is like, they've stated this too. They moved away from reviews or ratings. It doesn't really matter as much as the implicit signal of people just like what they pick.
Jon: 00:43:04
For sure.
Erik: 00:43:05
But anyway, I ended up building a lot of models in that vein and all those models were unsupervised as opposed to Netflix case, which is supervised because you have to label X and Y, but it's the same idea. In particular, the idea that I pursued and that worked really well was matrix factorization. And so, roughly speaking, distilling it down to intuition here, the idea is you create this enormous matrix, very sparse matrix where every row is a user and every column is a track or an album or an artist. And so, you end up with 10 million rows or something like that and 10 million items. Again, this matrix is extremely sparse, it's mostly zeros. But the answer in this metric is how many times did this user listen to this track?
00:43:58
Now, there's a bunch of techniques and there's traditional ones like PCA and SVD. I ended up using a bunch of different ones. In particular, I used a lot of NLP inspired models like Word2Vec was a model we used a lot. There's also a few other ones like real estate, it's like an old school version. Netflix had a bunch of papers and this one that... I forget by who, Koren and Volinsky, that people used a lot back then. But they all boil down to the same idea, which is that when factorize this matrix, you find a low dimensionality representation of every user and every item.
00:44:36
These representations are just small vectors. They're vectors of 40 real values. You have this vector space now where every user turns into a point in a 4D dimensional space and every track turns into a point in the 4D dimensional space. The dot product between the two, you can transform that into a prediction of how much does that user like that track. In that space now, as it turns out, tracks that are very similar tend to lie close together, same with users too.
00:45:16
In that lower dimensionality space, four dimensions, whatever, 100 dimensions, you have this nice geographical property. The space doesn't mean anything. The X axis doesn't correspond to anything or the Y axis or whatever. It's 4D dimensions, but proximity means something. If there's two points that are close to each other, it means those tracks are very similar and you end up seeing if you're plotting different genres, you end up seeing them clustering really well. The next question is, for every track, how do we find similar tracks? That turns out to be a messy problem because you're in four-dimensional space. You really want to avoid this, again, this O and square thing where you have to, for every track, look at every other track and compute the distance.
00:46:05
I ended up building this vector database called Annoy. A-N-N stands for approximate nearest neighbor. That basically helps you do those queries very fast because it turns out, you can do all these tricks in this 4D dimensional space and cut down the search space very aggressively. And so, you can then take the user and in that space, you can then look for track vectors that are close to that user, remove any track that the users already listened to because we have that data. Then those tracks then turn out to be great recommendations for that user, or you could take a track and look at similar vectors to that track and that turns out to be good, similar track recommendations or good, so the input for radio station or something like that.
Jon: 00:46:54
Looking to take your career to the next level but not quite sure how? Well check out Bunch, the AI leadership coach. Bunch is the easiest way to learn critical career skills like giving feedback, resolving conflicts in communicating with confidence and you can do it in just two minutes a day. Bunch is not a one size fits all course, but a fully personalized learning journey. You learn daily from a global community of coaches, managers, and executives from companies like Calm, HubSpot and Twitter. Download Bunch for free in the Apple App Store. Search Bunch Leadership, that's B-U-N-C-H leadership, or simply check out the link in the show notes.
00:47:33
Nice.
Erik: 00:47:33
That was fun. That's where I spent four or five years of my life.
Jon: 00:47:37
That was really cool to dig into that in detail. I didn't know any of that detail. My company uses a similar approach, so we ended up, we've tried lots of different kinds of ways of matching candidates to jobs based on the natural language of candidate profiles and the natural language of job descriptions. We have tried so many... I've been working on this problem for seven, eight years now, and we've tried so many different ways of tackling it. The way that we end up doing it is very similar to what you described.
Erik: 00:48:11
Some sort of embeddings and sort of vector model.
Jon: 00:48:13
Exactly. Using embeddings of these documents to turn them into a vector representation. Then they're in the same space, the same high-dimensional space. In your case, you can have users and you can have songs in the same map. In our case, we can have candidates and jobs. We can say, "Okay, we've got this job description. It locates to this particular point." We use 196 dimensional space. Then what candidates are nearby in that space, and you can use very simple mathematics like you're saying like a dot product.
Erik: 00:48:46
Totally. It's so conceptually beautiful once you get there. I mean, I think if there's any insight I made in my life that I'm proud of, it's back in 2008, I realized that vector models have all these cool properties that makes them very nice as a building block for recommendation systems. Back then I don't think a lot of, there was a Netflix prize, but I think it was still an obscure sort of in and all of that. I'm really happy today that these embeddings are quite commonplace.
Jon: 00:49:15
Oh, they're everywhere.
Erik: 00:49:16
Open AI has an embedding API. There's a number of vector database. I'm advising this vector database called Weaviate. There's a few other ones too. I really think that maybe if I hadn't done Modal like today or now advising Weaviate, but I think there's a huge opportunity to still build search engines and rethink them from ground up. We have this concept of a vector today. There's so many cool opportunities of doing stuff in that space.
Jon: 00:49:46
For sure. It's a really rich way of representing information for machine because it collapses. Like you were describing initially, you were starting with this huge sparse matrix with 10 million rows and 10 million columns. And so, to be able to collapse that down into this, and so sparse for listeners who don't know, it means lots of zeros, which makes a lot of sense in this scenario that Erik's describing where you have, if all the rows are listeners and all the columns are tracks, of course any given listener has listened to almost none of the tracks. It's all zeros. It's almost entirely zeros. Highly, highly sparse what that means. But then what we've been talking about lately with these vector representations, you have a really dense representation because then it means for any given track or any given Spotify user, you have this vector of 40 flow values and none of those, some of them might be near zero, but none of them are zero. They describe this rich location across these 4D dimensions.
00:50:39
And so, it's a really, really powerful thing. If you aren't already using vectors for something, you might want to learn about them and consider using them. It is how, like on my data science team, we're thinking about almost all of our problems in vectors all the time.
Erik: 00:50:54
Good.
Jon: 00:50:55
And so, you've told us now about Annoy, this open source library that you created for searching efficiently across these vectors. It's a high-dimensional nearest neighbor search. Do you want to dig into that, the technical aspects of that a little bit more?
Erik: 00:51:14
Yeah, there's a couple talks I've done, so there's probably some more presentations online if you want to go super deep. But the rough idea's, okay, you have this 4D dimensional space, how do you search that space in a less than linear way? You want to not have to go through every single point in that space and compute the distance. And so, the way Annoy works, which is to be fair is state of the art today uses other types of methods. But the way Annoy works is that it partitions the space into a tree.
00:51:48
And so, you basically pick roughly random hyperplane, although you can use the data in some way to inform it, you pick a random hyperplane and split the space and then put half the points on one side and the other half on the other side. Now you do that recursively with each sub-space. You take this space, split it again, and then you take each four of those and split them again until you only have, I don't know, 10 points left in each hyperspace. It's a four-dimensional space. Turns out those points that are in each leaf node of this tree are very close to each other.
00:52:27
On the other hand, there's this problem now, because sometimes you ended up by accident by picking hyperplane so that partition two points that are very close to each other. The trick then is to do this about a few hundred times or 10, 40 times or something like that. There's both during this search phase, when you're searching through this space, sometimes you actually go on both sides of the plane.
00:52:51
If you have a query point, you go down the tree and go on the same side always. Then you end up in a leaf note. But sometimes, you actually want to also look at the other side. You're close enough. The other trick, as I mentioned, is also you do this hyperspace partitioning scheme, 10 to 20 to 100 times where you start over with all the million points and repartition the space. And so, you end up with 100 trees and now you can search all those trees in parallel actually and find near points. That's sort of rough intuition. More modern methods I should point out, they use a very different approach, which is probably better. They use a graph-based approach instead. They try to build a graph of points that are close to each other. I know less about that, but there's a bunch of state-of-the-art algorithms by Facebook and a few other ones.
Jon: 00:53:44
Cool. That was a crystal clear explanation and there was some great hand waving that Erik did here. If you're listening to the audio version and you're like, "Maybe I could perceive this just a little bit better if I had hand motions." The YouTube version of the podcast will have those for you.
Erik: 00:54:01
I wish I could overlay it, like a split in the air. I could color the air red and blue.
Jon: 00:54:10
There's an amazing YouTube channel by someone named Alfredo Canziani. Have you heard of him?
Erik: 00:54:14
No.
Jon: 00:54:15
Alfredo is a lecturer on some of Yann LeCun's courses at NYU. What he's done is, so during the pandemic, they moved to having all courses online and NYU published these classes that Yann LeCun and Alfredo are offering online. Then what Alfredo painstakingly has done is he has done exactly what you described. He's added graphics that enable visualizations of equations and concepts and graphs.
Erik: 00:54:51
Yeah, that's amazing.
Jon: 00:54:51
He created a really cool YouTube channel.
Erik: 00:54:54
I'll have to check it out.
Jon: 00:54:55
Yeah, it's great. I'll try to remember to include a link to Alfredo in the show notes and if anybody knows him, try to get him on the show. About a year ago, him and I corresponded back and forth a bunch of times and he was interested in the show, but he was too busy making these very elaborate YouTube videos and I'd love to have him on. Yes, that's Annoy. Then, another really popular open source tool that you built at Spotify is Luigi, which is a great name. That one's really obvious to me because it's for pipes.
Erik: 00:55:28
It's green like Spotify. That's the other connection.
Jon: 00:55:34
That was going to be my next question is why not Mario, and now we know.
Erik: 00:55:36
Now you know. I mean at Luigi, I think before Airflow was big, Luigi had a lot of users and I think it's a similar thing where I wanted a tool, as I was building the music recommendation system at Spotify, I ended up with is very complex pipelines and I would've to run them for weeks sometimes and things would crash. I needed a way to resume from a partial state where certain things were created, certain things not. I had to deal with all these things, you write some data to disc, but is it really fully complete, atomicity of producing artifacts and mechanisms.
00:56:16
I ended up building Luigi. It was sort of inspired by makefile actually, which is all compiler thing that's really annoying to use but has this elegant idea of framing everything as a graph problem, dependency graph. People talk about DAGs, directed acyclic graphs, which in reality just means dependency graphs, you have this distinct depends on this, depends on this. Then Luigi lets you model that pretty nicely. And so, I think for a while, it made a lot of sense to use Luigi for a lot of those things. A lot of companies had similar problems as it turned out.
00:56:53
I open source it just thinking well maybe someone's going to use it. Then a couple years later, a lot of companies did. Over time a bunch of other stuff came out, Airflow came out. I think Airflow has certain things that are worse, but I also think overall, it has a webinar face that people really like as much other stuff. I think that clearly was the thing that resonated with people.
00:57:15
Today, there's even more competitors. It's like Prefect, there's Flyte, there's Dagster, probably a bunch that I always then forget. It's an interesting space. I think a lot about it. But Luigi, I'm glad someone called it. You either die a hero or you live to be a villain. I'm glad. His argument was that Luigi died a hero because it died out and then no one really uses that anymore. There are maybe couple companies still using it. But I don't think many companies use it today. I certainly haven't maintained it for many years and it's sort been left to atrophy a little bit. But I think the idea is at least I think hopefully will have survived-
Jon: 00:57:56
... have really taken off in these commercially backed applications like Airflow. You just mentioned that you let Luigi rot, so you've created a lot of open source repos over the year, including the very meta repo, which is a GitHub repository analysis project.
Erik: 00:58:18
I got interested in how code grows. Originally, I mean I started writing a blog post, actually the blog post drove the tool itself. I wrote a blog post about how programming projects evolve and I had this thesis that's well-structured program, well-structured repos, they have lower churn of the code, they have a solid foundation. Look at the Linux repo. The Linux have an extremely solid core in the kernel. Then all these drivers is super modular. All these network drivers and hardware drivers and you look at the Linux kernel. The average half-life of code is extremely high. I got interested in that, what is the half-life of code? And so, I ended up calling it the Git of Theseus after Ship of Theseus, which is this old apocryphal story of a ship. You replaced every part of the ship, is it still the same ship? I don't know, so that was the Git of Theseus.
Jon: 00:59:21
It often gets used with aging analogies when you talk about people aging.
Erik: 00:59:26
Every cell in your body's new, are you still the same person?
Jon: 00:59:30
Exactly. If, I don't know, I'm really into this idea of probably a lot of people are. I think we're at a point in history where certainly more than any other point in history so far, people are living longer, people are healthier longer, and there are innovations that are coming out that we're applying to mice and rats that allow would appear to be them getting younger from a biochemical perspective. It ends up bringing this Ship of Theseus analogy up a lot because this is idea that maybe you could live for hundreds or thousands of years, but you're none of the same parts as you were at the beginning. Then, it raises interesting questions about identity and self.
Erik: 01:00:17
Who are you? It's very deep. I'm convinced that we're the last generation to die, which is terrifying. I think people got to figure that out. My kids may not die. I don't know.
Jon: 01:00:30
Yeah, it's possible or it could end up being something, I think I might get this analogy quite wrong, there's probably some aging expert out there that's going to say, I'm getting something about this wrong. But I think that some organisms like tortoises, they don't age in the sense that they're no more likely to die in year 200 as year two. It's just that at some point, something happens, you get a virus or a shark eats you or whatever.
Erik: 01:00:57
Totally. There's a sort of hazard rate. They have a constant hazard rate. If you look at humans, their lifespan has this Gumbel, whatever, I don't know how to pronounce it, Gumbel distribution, which the hazard rate grows over time. Your likelihood of dying goes up every year. Whereas if you look at tortoise, like you said, they have a, I believe, exponential distribution, which has a constant hazard rate. The only reason why I know a lot about this is I actually spend a lot of time working on survival analysis for a very different... I looked at conversion of paying customers, like that kind of stuff.
Jon: 01:01:30
How can we make our best paying customers live forever? We're going to take... Where are-
Erik: 01:01:34
Not quite that. I was more interested in what probability distribution tends to model a person coming in? How long they're going to take to convert? But it turns out, all the literature about that analysis is called survival analysis. It comes from understanding mortality, particularly in actuarial science, like life insurance. Instead of looking at people converting from buying a product, it's about people dying. And so, it's a little interesting, morbid.
Jon: 01:02:06
Right. Put actuaries out of business.
Erik: 01:02:09
We veered off. We were talking about-
Jon: 01:02:11
We have veered off. It's all related, man. Ship of Theseus is where we left off.
Erik: 01:02:18
Ship of Theseus.
Jon: 01:02:18
You were making some software-related point.
Erik: 01:02:22
And so, anyways, I ended up building an open source repo that analyzed this and I still actually use it on my own code to look at growth over time. It's very interesting.
Jon: 01:02:33
Cool. We got that repo. You had repos for AI algorithms, models that generate fonts that play chess and several others. We're going to leave a bunch of them for you in the show notes such as the deep fonts.
Erik: 01:02:54
That's old to be fair, that was the time in my life when I had more time to play around with.
Jon: 01:03:01
That's where I'm getting to with this question is, Erik, we've noticed that you haven't had as much time lately, you're creating a startup. There doesn't seem to be as much time. We're not seeing as much activity on GitHub as we used to from you. What would you be doing today if you have the time, you exit Modal tomorrow and on Monday, you can start on some new open source project. What do you think you would do?
Erik: 01:03:32
I don't know. I mean Modal is in a way the thing I always wanted to build. I'd certainly love it.
Jon: 01:03:39
You just exit and then you just stay working there creating open source versions of things.
Erik: 01:03:45
I don't know. I mean, one area that I've been interested in that I never have time to investigate is I'm very interested in probabilistic programming, like quantifying uncertainty. I find that those methods have a lot of promise but are very inaccessible in my opinion. I've been in so many business meetings where someone just throws up a chart and then you realize when you ask questions, this is based on way too few users. You can't really make reliable conclusions from this because it's very noisy. But quantifying uncertainty is actually a very hard problem. And so, I think that's an area that I've tried to learn slowly over time learning a lot about Bayesian statistics.
Jon: 01:04:32
I was just going to say, are you a Bayesian guy?
Erik: 01:04:33
Yeah, Bayesian statistics, MCMC methods. It turns out, I mean, it's hard these things, but I wonder if there's way to make it more accessible. Actually, speaking of charts, I also think visualization is such a massive opportunity. I love spending time on making charts beautiful and it's just really vanity side of me, but it's hard. I wish there's better tools for that. That's actually an area I would love to spend time on.
Jon: 01:05:03
I used to, prior to being big in Python, I was big into R and I found plotting in there like the ggplot2 to be really fun to work with. People have created it. There had been, I don't think they're maintained anymore, but there used to be GitHub repos. There were supposed to be that same grammar of graphics ggplot style in Python and it never had all the same functionalities are. To my knowledge today, those libraries aren't maintained very well.
Erik: 01:05:31
It's actually something I always wanted to spend time on. If I had more time, I would definitely sit down and learn ggplot because I keep hearing the same thing. Plotting in ggplot is much better than it's ever been in Python. I love and hate Matplotlib. It makes me in the end usually make beautiful charts, but it's very confusing and not intuitive in my opinion. I would love to learn ggplot.
Jon: 01:05:56
Let's put a pin in these tools that you love for one second so that I can mention if listeners are interested in episodes on probabilistic programming. In episode 585, we had Thomas Wiecki who leads the PyMC project. That's a fascinating episode on probabilistic programming if those listeners want to hear more about that. But then back to the thing that you were just talking about, Erik, which was tools that you'd love to learn more about. Are there tools that you are excited about today that you are using or have just gotten into that you'd love to share with the audience that you think listeners should know about?
Erik: 01:06:32
I mean, in general, a thesis I have and Modal is certainly part of that thesis. By the way, Modal is obviously a tool selfish people should learn about. But besides Modal, let's put aside for a second, I'm not going to promote it too much, but I think there's a broader trend of things moving towards serverless and people not having to think as much about infrastructure and actually think that most of the cool projects so far has been more front-end related or database related. That's an area where I think there's a lot of cool stuff.
01:07:13
I was going to say in general, on the front-end side, I dabble with front-end, but I'm a terrible front engineer, but it feels like there's just so much crazy innovation going on there that if I had time, I would spend more time on things really learning new cool things like Svelte or Next.js or whatever it is, all these crazy stuff they're doing around building custom compilers or doing Wasm stuff. I think there's some really cool stuff there.
01:07:40
But going back to serverless stuff and infrastructure stuff, I think a tool I've been using a lot recently is Pulumi that I think is cool. It's essentially just Terraform but programmable. I think in my opinion, the only flaw with Pulumi is it doesn't go far enough. I would love to just put application code together with infrastructure and just make it all code and just have the code describe its own infrastructure.
01:08:07
But what I like about Pulumi is to me, it was the first framework where I felt playing around with it. It made me feel like I'm programming the cloud, I'm programming infrastructure, which I think is a really cool feeling. But other than that, I think there's a lot of databases that are really exciting, DuckDB, SQLite. I think Neon is new database that serverless that Nikita from SingleStore is working on. There's a lot of that kind of stuff, a lot of innovation in that space that I wish I had more time to play around with.
Jon: 01:08:42
That was an amazing list of different categories of tools and then lots of specific examples in those tools. That was a really impressive run through. Something that we started doing, I don't know, six months ago roughly in the podcast is we find five segments of the whole episode that we think are really great answers to a specific question. Then we put those as, so it could be a 2-minute video or a 10-minute video on YouTube on this discreet topic, and I know for sure that what you just did running through all of those super valuable tools across so many different spaces, this is going to be a standalone YouTube core feature.
Erik: 01:09:26
Okay, sounds good.
Jon: 01:09:28
That was awesome. We have some audience questions. When we have really well known guests come on the show like you, I try to, about a week before we record the episode, post on social media, on LinkedIn, and on Twitter that, "Hey, Erik Bernhardsson's going to be on the show. Do you have any questions for him?" The social media post about you had a huge amount of engagement, at time of recording over 13,000 impressions on the post that you will be here on the show on LinkedIn.
01:10:03
We had a great question here from Mathias Baudino. He's a BI analyst at Brain Technology. He says, "What an amazing guest! As a question, I would ask Erik what are the benefits of developing Luigi over adopting an existing tool and work from there?" He says, "I know that if you make the framework yourself, you will have much more control over its features and can develop it to serve your needs to the most complex details, but I would love to hear the response."
01:10:31
This does tie into something that we've already talked about earlier in the episode, but from a specific angle, which is just this idea of when you are... I don't think he means to be talking about Luigi specifically. He's talking about when you're thinking about creating an open source project, how do you decide, nope, there's definitely... You mentioned doing it out of spite before. How do you know spitefully that certainly the tool that you need doesn't already exist and you're going to have to make it?
Erik: 01:10:58
I still think that my rule of thumb here is that what should you open source or not? I almost think you should open source by default. I think for most stuff, you end up building at a company you should think of open source not as you opt in to it, but you opt out to it. When you opt out, you opt out if it's either core competence of the business or if it's not standalone or if it's plausibly not useful for any other. But if those three things are true, is this something that's relatively standalone? It's self-contained like it sits in a box over here and does a thing and it's potentially useful for other people and it has this attribute where it's... or there's no core competence. If you're not putting out business secrets here, then I think it might be a good thing to do to open source something.
Jon: 01:11:56
Awesome. That is a crystal clear answer. Erik, this episode has been phenomenal. I expected nothing less. Before I let you go, a question that I ask all of our guests on the program is if you have a book recommendation for us.
Erik: 01:12:15
I don't know. I used to read a lot of books. The last couple years has been a little less with kids and startups and stuff. One random area I fell into and then I just kept going was I got really interested in companies that failed and I feel like there's almost a whole genre of business books about it. There's a very good one called Hard Landing about the airline industry that's very good about the struggles of making money in the airline industry. There's, of course, a number of company scandals. There's a good book about Enron, I forgot the name. Is that The Smartest Guys in the Room or something like that?
Jon: 01:12:54
That is what it's called I think.
Erik: 01:12:56
There's Bad Blood about Theranos. There's a great book about research in Motion called Losing the Signal. That is very good. There's a pretty good book about G.E. called Lights Out or something like that, General Electric.
Jon: 01:13:12
That's right.
Erik: 01:13:12
There was another good book that was just thinking about Long-Term Capital Management, When Genius Fails. It's a very interesting book. I don't know, I got so fascinated in this, because they're all examples of companies, maybe with the exception of Bad Blood, which was outright fraud and Enron too, but companies who did a lot of things well but still led to massive destruction of shareholder value. I find that really fascinating. It's like, what ended up happening, what went wrong? That's been an interesting genre I've been reading a lot about.
Jon: 01:13:51
It seems like the kind of thing that there might be some satisfaction in reading some of those books, especially the Theranos situation, people committing that kind of fraud. Then you get to probably enjoy some schadenfreude as they start to get caught.
Erik: 01:14:06
For sure.
Jon: 01:14:07
While at the same time, you get to learn business lessons for yourself.
Erik: 01:14:11
Maybe that's a cynical thing is I just like to have, it's the schadenfreude. But I think even in Bad Blood, there's something I'm trying to understand here. It's like why did investors believe in this? It's wild.
Jon: 01:14:23
So wild.
Erik: 01:14:24
I think there's this suspense of disbelief going too far and I think there's these organizations of complacency and yes men and it goes much further. I almost, I got really interested in dictatorships and how they sustain and what keeps them going. I think there's so many interesting patterns throughout history and corporate history of companies that create a delusion and then eat that delusion and just keep living off that delusion, and then one day, it just all comes down. I think that's fascinating. I don't know.
Jon: 01:15:02
I agree. I don't know. I find that fascinating right now. There are a few regimes in our world today that have taken some big moves and it seems to have made the rickety foundations ricketier and who knows, maybe at the time of this, I'm not going to talk about specifically what the things are because by the time that this episode airs, it might be a very different situation, but there are some really interesting things happening.
Erik: 01:15:25
I'm also not going to get into specifics, but I think if you surround yourself with people who say yes to all your crazy ideas and who will praise you for coming up with those ideas, you're going to make more and more stupid ideas.
Jon: 01:15:38
Exactly. We've seen some of that in recent years. Erik, this has been an awesome episode. I've loved digging deep into technical topics, startup entrepreneurship topics as well as our random asides into things like the Ship of Theseus and how we're going to live forever, for sure, definitely. And so, starting to wrap up the episode, no doubt there are many listeners out there who would love to hear more from you after the podcast. How should they follow you?
Erik: 01:16:10
You can follow me on Twitter. My handle is Bernhardsson like my last name. It's kind of hard to spell, but I'm sure you can share it in the footnotes.
Jon: 01:16:18
For sure.
Erik: 01:16:18
You should obviously check out modal.com. You should check out my blog erikbern.com, which needs a little bit more blog posting, which I'm working on right now finally after-
Jon: 01:16:29
Can you give a sneak peek as to what you're working on?
Erik: 01:16:33
Actually, I have multiple blog posts. I just need to start the finishes, it's kind of ridiculous, but one I'm working on is where I think the cloud is going. It's actually mostly an extension of a lot of tweeting I've been doing recently, but my sort of grand thesis is we're still fairly early in cloud adoption and we still haven't really thought through how we operate with the cloud as developers. I think there's some massive value creation that can happen if we start to realign our workflow with how the cloud works and it enables us. That's one thesis I have and a bunch other stuff also I'm working on.
Jon: 01:17:09
Exciting. We look forward to that. You are an epic Tweeter, an epic blogger. It's been an honor to have you on the show, Erik.
Erik: 01:17:18
Thank you.
Jon: 01:17:18
Maybe in a couple of years-
Erik: 01:17:19
It's been a lot of fun.
Jon: 01:17:19
... we can catch up again.
Erik: 01:17:20
Would love to. Thank you.
Jon: 01:17:22
Nice. Thanks, Erik.
Erik: 01:17:24
Thank you.
Jon: 01:17:30
Well, the legendary Erik Bernhardsson certainly did not disappoint. I've been following his work closely for years, so it was such an honor to meet him and chat data science with him today. I hope you gained as much from our conversation as I did.
01:17:43
In the episode, Erik filled us in on how he started his company Modal because he needed the tools they're building and how he aims to eventually replace the Kubernetes standard for model deployment. He talked about how he recommends focusing on hiring goal-oriented people as opposed to tool-oriented ones. How the best way to impress him in an interview is to explain an end-to-end data science project you undertook to solve a problem that personally interests you. He talked about how the Spotify music recommendation model relies upon collaborative filtering of implicit song preferences and collapsing sparse matrices into dense vector representations in order to be able to provide recommendations instantaneously.
01:18:21
He talked about how his Annoy library enables efficient searching through high dimensional vector spaces by partitioning the space with random hyperplanes and how he particularly recommends the tool Pulumi, an open source infrastructure, as code tool.
01:18:36
As always, you can get all the show notes including a transcript for this episode, the video recording, any materials mentioned on the show, the URLs for Erik's social media profiles, as well as my own social media profiles at superdatascience.com/619. That's superdatascience.com/619. If you'd like to ask questions of future guests of the show like an audience member Mathias did during today's episode, then consider following me on LinkedIn or Twitter as that's where I post who upcoming guests are and ask you to provide your inquiries for them.
01:19:05
If you'd like to engage with me in person as opposed to just through social media, I'd love to meet you at the Open Data Science Conference West, ODSC West, which will be held in San Francisco from November 1st through 3rd. I'll be doing an official book signing for my book, Deep Learning Illustrated, and will be filming a SuperDataScience episode live on the big stage with the world leading deep learning and cryptography researcher Professor Dawn Song as my guest. In addition to the formal events, I'll also just be hanging around, grabbing beers and chatting with folks. It'd be so fun to see you there.
01:19:40
All right. Thanks to my colleagues at Nebula for supporting me while I create content like this SuperDataScience episode for you. Thanks of course, to Ivana, Mario, Natalie, Serg, Sylvia, Zara and Kirill on the SuperDataScience team for producing another killer episode for us today. For enabling this super team to create this free podcast for you, we are deeply grateful to our sponsors. Please consider supporting the show by checking out our sponsors' links, which you could find in the show notes. If you yourself are interested in sponsoring an episode, you can find our contact details in the show notes as well or you can make your way to jonkrohn.com/podcast. Last but not least, thanks to you for listening all the way to the end of the show. Until next time, my friend, keep on rocking it out there and I'm looking forward to enjoying another round of the SuperDataScience Podcast with you very soon.
Show all
arrow_downward