Jon Krohn: 00:00:00
This is episode number 665 with Josh Wills. Today’s episode is brought to you by Pathway, the reactive data processing framework and by epic LinkedIn Learning instructor Keith McCormick.
00:00:16
Welcome to the SuperDataScience Podcast, the most listened-to podcast in the data science industry. Each week, we bring you inspiring people and ideas to help you build a successful career in data science. I’m your host, Jon Krohn. Thanks for joining me today. And now let’s make the complex simple.
00:00:47
Welcome back to the SuperDataScience Podcast. Today we’ve got the extraordinary Josh Wills on the show. Josh has done a startling amount in his career. He worked to decarbonize transport as a software engineer at WeaveGrid, a Bay Area startup. He modeled Covid-19 full-time for the Government of California in early 2020 as the pandemic was first kicking off. He was the first Director of Data Engineering at Slack. He was Director of Data Science at Cloudera. He was Staff Software Engineer at Google. He’s co-authored several editions of O’Reilly books on advanced analytics. Has given countless thought provoking and very funny talks at major data science conferences, and now describes himself as a “gainfully unemployed data person” as he contributes to open source software projects and develops his Data Engineering for Machine Learning course.
00:01:36
Today’s episode will appeal most to technical listeners that are keen to be outstanding data scientists or software engineers, particularly through engineering scalable machine learning projects. However, much of the content in this episode will appeal to anyone who’d like to hear from a brilliant, thoughtful, and seasoned professional who goes into depth on the orders-of-magnitude more efficient contextual bandit approach to testing models in production, how to avoid the “infinite loop of sadness” in data product development, the pros and cons of choosing a management-track career path relative to an independent contributor path, and what it’s like to be called on as a life-saving data-science superhero during a catastrophic global event. All right, you ready for this delicious episode? Let’s go.
00:02:24
Josh Wills, welcome to the SuperDataScience Podcast. It’s awesome to have you here. Where in the world are you calling in from?
Josh Wills: 00:02:31
Hey, Jon, how you doing? Thanks so much for having me. I am calling from San Francisco, California.
Jon Krohn: 00:02:37
Nice. And actually on the YouTube version, it looks like we can see that you have a sunny day there, perhaps?
Josh Wills: 00:02:42
We do actually have a sunny day here. But you do not always get, I live in the western side of San Francisco, and so it’s not always sunny on this part over here, but we’re having a little, little sort of window of beautiful weather before another atmospheric river comes in a few days and like, you know, slams us again. So I’m enjoying it while we can. Yeah.
Jon Krohn: 00:02:59
I haven’t heard of these atmospheric rivers, but I guess that’s like, it’s just why it’s always cold in San Francisco all the time.
Josh Wills: 00:03:05
Oh, no, it’s these really great fun like weather storm things they call them. It’s called the Pineapple Express and it’s like essentially like a river in the sky.
Jon Krohn: 00:03:14
I thought that was a strain of weed.
Josh Wills: 00:03:16
Yeah, I mean, it’s that too, which also, you know, San Francisco. But you can get a contact high just from walking around outside for an extended period of time. But no yeah, sorry. Just like I, I’m a nerd for weather stuff as well. I have like, I think, you know, you find this episode I have like a bunch of nerdy interests and stuff, and so like Daniel Swain and like Weather West, Weather Twitter, I love this stuff. We’re obsessive about the weather out here in California. Like even by America’s standards, we talk about the weather a lot.
Jon Krohn: 00:03:41
Well, but I interrupted you. What is a Pineapple Express other than being Pineapple Express?
Josh Wills: 00:03:44
So, yeah, so there’s typically a giant high pressure system off the coasts of California that protects us from moisture, which is why it’s like always so sunny and nice here but occasionally, occasionally these low pressure systems manage to kind of pierce the high pressure system. And when they do like literally just an absolute torrent of rain and snow and other kind of weather comes again, typically originating from Hawaii and just slams right into us. And this has been happening here for the last couple of months or so. It’s been, you know, after like years and years and years of drought out here, it’s been just extraordinarily rainy, extraordinary, like, it’s like off the charts, like quantities of snow up in the Sierras and stuff like that right now.
Jon Krohn: 00:04:22
Right, right. Yeah. Friends in the area have had, like, they’ve had to have emergency crews to like deal with all the trees that have fallen.
Josh Wills: 00:04:30
Exactly. Snow everywhere. And like, yeah. And again, and this is just not something we deal with well at here in California, like at all. We’re not equipped for this kind of stuff. And so anyway, yeah, this is, like I said, it’s a big deal for us.
Jon Krohn: 00:04:42
Well, so we know each other not through any kind of weather affiliation, but through-
Josh Wills: 00:04:48
No, that’s true.
Jon Krohn: 00:04:49
Erik Bernhardsson.
Josh Wills: 00:04:50
Yes, absolutely.
Jon Krohn: 00:04:50
Really brilliant data scientist, software engineer, entrepreneur.
Josh Wills: 00:04:57
Yes.
Jon Krohn: 00:04:57
And he was an episode number 619 of this show, which I recently did the tallying. It was the sixth most popular episode of 2022.
Josh Wills: 00:05:09
Not bad. Not bad.
Jon Krohn: 00:05:10
Which is yeah, unsurprising, given how interesting he is. And yeah, I asked him, I said at the end, after we’d finished recording, I was like, “Erik, you’re one of the most interesting people in data science and software engineering. Do you have anyone else that you could recommend for the show?” And he said, “Josh Wills”.
Josh Wills: 00:05:26
Oh man. That’s, that’s, I am, I am honored. That is, that is high praise coming from Erik. Yeah. Cause he’s like, Erik is like, who I want to be when I grow up. So that’s, that’s like deeply flattering. Yeah. Yeah.
Jon Krohn: 00:05:35
And then more recently we had a guest on the show, Chip Huyen. She was in episode number 661.
Josh Wills: 00:05:43
Yes.
Jon Krohn: 00:05:44
And she, she knew that you were coming up on the show in a couple of weeks. So she left a message for you.
Josh Wills: 00:05:51
I think I literally have an autograph copy of her book right next to me. That’s like kind of fantastic.
Jon Krohn: 00:05:55
Yeah. And we’re gonna talk about that later in the show. Okay. But so she, she, our very first question on the program is from Chip. She says, how did Josh become so interesting? He has viral quotes on Twitter all the time. How does he do it?
Josh Wills: 00:06:10
Yeah. How do, how do I, how did I become so interesting? That’s, that’s a tough one.. Yeah.
Jon Krohn: 00:06:17
We’re gonna dig into,
Josh Wills: 00:06:18
Do I have to answer now? Or I can think about it for a bit? I don’t have to…
Jon Krohn: 00:06:21
You can think about it for a bit. And we’re also, the audience is gonna kind of get a sense of that because we’re gonna go through the, in kind of reverse chronological order, the extremely interesting things you’re doing today, and then go back. Yeah, it is, it is mind blowing our researcher search at a field day, trying to compress all of the interesting things you’ve done.
Josh Wills: 00:06:42
Okay. All right.
Jon Krohn: 00:06:43
Into the questions here.
Josh Wills: 00:06:44
Into [inaudible 00:06:45]. Sure. That makes sense. Okay.
Jon Krohn: 00:06:46
First interesting item on Josh Wills.
Josh Wills: 00:06:48
Yes.
Jon Krohn: 00:06:48
What you’re doing right now, you’re an instructor for a brand new comprehensive course called Data Engineering for Machine Learning.
Josh Wills: 00:06:55
Yes, that’s right.
Jon Krohn: 00:06:56
And so this course is all about architecting the data infrastructure required to support scalable machine learning in production environments. Josh, tell us a bit about the course.
Josh Wills: 00:07:06
Yeah, yeah, yeah, absolutely. So the course, I teach with a, I teach with an organization called Sphere that teaches any number of classes on sort of, all sorts of, of interesting technical topics.
Jon Krohn: 00:07:17
Chip actually, she has a Sphere course.
Josh Wills: 00:07:20
She does. She teaches, she teaches a course based on, again, her excellent, excellent book Designing Machine Learning Systems. I don’t have a book on my, on my topic, but I’m trying to distill my experience building data infrastructure at Google, at Cloudera, for other companies, and then ultimately at Slack into like sort of four sessions, two hours a piece where we kind of go through like, this is how all of your data ingestion stuff. Like, like what, what should you do from a data engineering perspective when you’re doing machine learning as opposed to when you’re just doing like BI and reporting and stuff like that. Like where, where do you need to step up your game? Like I generally think of like machine learning as like software engineering on hard mode in every way. Like, debugging is harder, monitoring is harder, everything is harder than it is under regular, because just, you know, it’s I describe it as like, especially with like all the, you know, ChatGPT stuff going on right now.
00:08:16
It’s like deliberately injecting randomness and like spaghetti code effectively, like these gigantic vectors into your production systems, like intentionally, like on purpose, right? And so how do you engineer around that? How do you defend against all these things that can go wrong when that happens, and so on and so forth. And so that’s like, that’s basically what I try to teach. It’s hard, there’s like a lot to cover and we only have like eight hours to do it. So we start with like, just the absolute basic foundations around, like data ingestion like how do we schema things? How do we evolve schema over time? How do we be very, how do we be very, very rigorous about you know, like I think the, the terminology I joked the other day we, before ChatGPT, we used to talk a lot about data contracts in the, in the kind of data Twitter sphere.
00:09:07
But like, that’s kind of that idea. How do we essentially guarantee the changes in production, don’t break the machine learning pipelines that we use to power other parts of production, that kind of stuff. A lot about feature engineering, feature stores. Like how do you do this sort of, these calculations you have to do scalably, how do you do them fast enough? Again, this is something Chip knows a lot more about than I do, but I do devote a section to the class to it. Data quality monitoring, again, becomes like not just a nice to have, it becomes like absolutely essential when like the data quality, the data is feeding, again, production machine learning systems. And then finally for me, like the most fun stuff, which is I always think of as like the highest best form of, of data science is around like, experimentation, AB testing things, and then kind of beyond that contextual bandits.
Jon Krohn: 00:09:51
Yeah.
Josh Wills: 00:09:51
Like, you know, and, you know, like this is one of the areas I’m passionate about. Like a lot of folks just build a, like one-off machine learning model in a notebook and just be like, “Yeah, look at this cool model I build. How, how great is this?” Right? But like, when you’re doing this for real, like for money, you never stop building. You’re constantly iterating on the model. You’re constantly trying to improve it. If you’re, if you’re not gonna do that, like why are you doing it in the first place? It’s, it’s not worth it. Right? And so exploring techniques for like, how do we get better, faster, you know, with less data is, is the final sort of like, focus of the class and stuff. So, yeah. These are, it kills me. These are all things that I could devote, you know, months of, of, of discussion and topic too, and stuff I have to cover in a relatively short period of time. So I’m working with Sphere to like kind of break out the course into like focused sections, basically like each of each of these sort of like, each of the individual sessions could be a course in and of itself. And I’m like, I’m frustrated that there’s like so much to say and so much to cover and not enough time to do it. So anyway. Yeah.
Jon Krohn: 00:10:51
Let’s specifically dig into that contextual bandits bit.
Josh Wills: 00:10:54
Okay.
Jon Krohn: 00:10:54
I think a lot of people know what AB testing is. Maybe we can use that to like ease in what is like, you know, for me, AB testing is like really obvious when you’re thinking about it in the context of, “Oh, should we show a user interface that’s blue or green and which one’s gonna get people to click on more ads or buy our product” or whatever.
Josh Wills: 00:11:14
Exactly. That’s right. That’s right. Yeah.
Jon Krohn: 00:11:16
But yeah. What’s, what’s that like in the context of machine learning? A kind of AB testing?
Josh Wills: 00:11:20
AB testing is, AB testing, multi-variate testing is, is the gold standard for establishing, you know, like, like whether something is, is sort of meaningfully better than something else, or a meaningful improvement in some sense. There, there are, however, the, the sort of a cost that goes along with AB testing the cost is that it takes a long time. It takes a long time to gather enough data to achieve the statistical significance that you’re looking for. And there are a lot of problems where taking that time and making that investment is worthwhile and is definitely something you should do. There’s a lot of things that are like, you know, big deal decisions, right? That are, that are not reversible. For instance, like you know, credit check stuff, are you approving someone for a loan? Like these are, these are big deal things.
00:12:05
These are like, these are heavyweight. And so improving these systems, the AB testing is the way to go. There’s a lot of other problems though, where, like the AB testing approach is kind of overkill. And, and the, the kind of classic one I think is like content recommendations, recommending news articles for people to read, recommending episodes of, of your podcast to listen to no individual decision is like life or death, right? Like, it’s, it’s not, it’s not the end of the world of someone gets like one bad recommendation out of 10 or something like that. Like the Netflix, the Netflix movie page is kind of in the same sort of boat, right? And so if you’re in that regime, if you’re in a world where like, it’s not a life or death decision, it’s not all or nothing, you can use some techniques including basically intentionally randomly injecting results, randomly injecting, you know, answers in order to much, much, much more rapidly sort of develop an understanding and improvement and figuring out like the policies and the algorithms that will do a better job of recommending in a way that you can’t in kind of other circumstances and stuff like that.
00:13:07
So again, like identifying when AB testing is the right thing to do, and identifying when contextual bannets can, like, I mean, and, and not just like a little faster, like, you know, an order of magnitude or two like 10 times faster, 10 times less data is, is what we’re talking about here with contextual bandits. And so, yeah. And again, it’s, it’s one of those things, like, it’s, it’s interesting to me like contextual bandits are like everywhere at Google and Facebook and stuff, and not generally out in, in sort of society at large. I don’t think, like we don’t see nearly as many of them as I would, as I would expect, given how useful they are. I think and this is just, again, something I kind of wanna cover in the class is like, Hey, this is, this is really, really cool. Very, very useful stuff that is worth, is worth having, like in your toolkit for data, for tackling data science problems. Yeah.
Jon Krohn: 00:13:52
Nice. So to try to summarize that back for you, and I’m probably gonna butcher this, but contextual bandit is kind of like AB testing, but instead of having it be this like, oh, we’re gonna give 50% of users A and 50% of users B, instead, you have a circumstance where like showing people Netflix films, it isn’t a matter of life or death, whether they it’s not like predicting a tumor in terms of accuracy importance. So you can insert in some random movies, some random outputs and that that data point provides you an order of magnitude more quickly in, in aggregate with how to be optimizing your, your platform or your machine learning decisions.
Josh Wills: 00:14:40
Yeah, exactly. Exactly. You can, you can sort of go back and, and play back the data points and evaluate algorithms based on how well they would’ve, like, how well they would’ve matched the optimal choices. The, the sort of standard way to do this would be to have like, like a current best model that is doing it. And 90% of the time the current best model is allowed to make the recommendation. But then like 10% of the time we essentially randomly choose some items to recommend, and then we evaluate like, how did, what feedback do we get? Like, did the person select the best one, or did they select like the randomly selected one? And if they selected the randomly selected one, that’s really interesting. That tells us there’s something about the, the best model that needs to be improved. And so we can be continuously training on this data.
00:15:22
And again, for like, like news articles is kind of a great example. News articles change all the time. Like the Zeitgeist changes all the time. Like when ChatGPT happens, we wanna like immediately pivot all of our recommended news articles to be about ChatGPT and contextual bandits give us the opportunity to make those very sort of fast adaptive pivots to new data when we see it. That is, again, harder in AB testing where it’s like, I gotta wait a while, I gotta see, make sure the data is, you know, correct and stuff. And so again, when the stakes are lower, making the decision faster has a tremendous amount of benefit anyway. Yeah.
Jon Krohn: 00:15:55
Are you moving from batch to realtime? Pathway makes realtime machine learning and data processing simple. Run your pipeline in Python or SQL in the same manner as you would for batch processing. With Pathway it will work as is in streaming mode. Pathway will handle all of the data updates for you automatically. The free and source available solution based on a powerful Rust engine ensures consistency at all times. Pathway makes it simple to enrich your data, create and process machine learning features, and draw conclusions quickly. All developers can access the enterprise proven technology for free at pathway.com. Check it out.
00:16:32
Nice. Thanks for filling us in on contextual bandits sounds like they are super powerful and people should be checking out your course to learn about them if they aren’t already aware, given how widely useful they could be in machine learning applications. Another really powerful topic in your course is on data quality and monitoring. And you’ve talked about these topics a lot. Will link to a few blog posts and talks that you have on it in the show notes. So data quality and monitoring. Where in the pipeline, in the data pipeline or the machine learning pipeline are data profiling and quality checks necessary?
Josh Wills: 00:17:09
Oh. And it’s like, I mean, I think everywhere is kind of a, you know, like terrible answer. I think. I think if you kinda look at the journey of my career, Jon, like, I started out as just as effectively like a data analyst kind of person, and I’ve sort of slowly worked my way down the software stack. And so like eventually I think I’ll become like a chip designer or something, or like, I’ll be, I’ll be like, fabbing, fabbing silicon at the rate I’m going. But for me,
Jon Krohn: 00:17:38
You’ll, you’ll be shaping kernels of sand.
Josh Wills: 00:17:41
I’d say actually that, I mean, actually sounds kind of peaceful, dude, now that you mentioned it actually sounds kinda nice. But the, the sort of the data quality and monitoring stuff was, it was kinda like my main sort of leap from like offline data pipelines sort of model training stuff to online like model serving inference kind of stuff in, in a lot of, in like in the work I do, right? And so I had to learn about how do regular software engineers who are not doing machine learning, how do they monitor and do observability and stuff like that on their software systems? Because I suddenly needed this stuff, that I needed to start understanding how this stuff worked. And so I could take advantage of the stuff that Slack had in order to be able to detect data quality problems. Like yeah, essentially in real time and, and like in real time, and this is something, there’s a lot of MLOps, lots of, you know, data quality monitoring systems and stuff like that.
00:18:35
And really like reaching a point where like, I’m carrying a pager and I’m alerted when like a bad observation or like a search query comes in that like destroys the search index or destroys the, the, the query backend, these are all things that happen, was a big, you know, like inflection point for me in my career where I became, I sort of, you know, like I guess I’ve progressively become less and less of a data scientist and more and more of a software engineer. And that was, that was a pretty big step in that direction. So yeah, I mean, I am like, I think one of the things I learned at Google is that simple analysis on top of high quality data, like usually wins the day, like usually wins the day. And it’s a lot easier to have, like, it turned out to be like at least a Google, like, a lot easier to have really, really high quality data and do simple things with it, as opposed to having kind of meh data or like a little bit of good data and then a bunch of maybe garbage data and then doing some fancy stuff to make it, to sort of improve it or try to do it right, whatever.
00:19:42
That is sort of the ethos that I’ve tried to bring with me everywhere I went from like Cloudera to Slack to whatever. I’ve always been militant about like, we need to have schemas, we need to have quality tests everywhere. We need to know as soon as possible if there’s a data quality issue. Like not in 15 minutes, not tomorrow. We need to know right now. Like, that’s, that’s kind of been my ethos for that reason. And so yeah, like learning how to sort of the observability stack worked from Prometheus to Elasticsearch to Grafana to tracing tools like Jaeger and, and Honeycomb and stuff was, was just a huge, huge, like leap forward for me. I think, I guess like there’s a lot of MLOps solutions out there and there’s, and there’s like some really, really good ones. There’s some really cool stuff.
00:20:24
For me though, like catastrophic machine learning failures are typically data quality problems and like very sort of in some ways kind of prosaic, boring data quality problems they field that should be there, isn’t there? A value is taking on a value, like a field is, taking a feature is taking on a value it’s never taken on before. Like stuff like, like stuff that’s not like rocket science that you can to check for as opposed to things that are more subtle, like model drift. Model drift is a real problem. But like you can often just get around it in a lot of cases by retraining your models all the time or using contextual bandits, for instance, retraining them in real time. And so so yeah, that’s, anyway, that’s sort of like why I harp on this stuff. I don’t know. I feel, I mean, I feel like I’ll always, like my son, my son plays basketball now and a lot of it’s like, how do you convince them of the fundamentals?
00:21:12
Like, you know, dribbling, passing, shooting technique stuff, like how do you make the fundamentals interesting? And that’s kind of the same challenge I have in talking about this stuff is like, how do I make this very, very boring, low level fundamental stuff interesting and engaging to people. And that’s kind of been like the, the arc of my career is trying to figure out how do I do that? Like, cuz it is just, it’s just super important and, you know, it’s important, but it’s also just kind of boring and unfun. So like that’s, that’s the trick here anyway.
Jon Krohn: 00:21:38
Yeah. That I can commiserate without doubt, because my first foray into content creation was deep learning stuff.
Josh Wills: 00:21:47
Okay.
Jon Krohn: 00:21:48
And then once I had finished like a couple dozen hours of video courses on deep learning, and I had written a book on it, I was like, man, you know what my students really need to know more of and I really need to know more of is linear algebra, calculus, probability theory.
Josh Wills: 00:22:04
Totally. Totally.
Jon Krohn: 00:22:06
That’s what I’ve been teaching lately.
Josh Wills: 00:22:08
That’s awesome. That’s, that’s that is fantastic. Again. Yeah. Not, not sexy, not whatever, but man, is it, yeah. Everything builds on top of that. And I don’t know, but I guess I found this especially at, at Google, this is, I, and I guess I wanna tie this in the ChatGPT if you don’t mind, because I mean, like the, the problem is we’ve not talked about ChatGPT enough, right? But like at Google, again, like way back in the day, like 2008 when I started there you’d interview people for like statistician jobs. Like, it was, it was sort of like, we didn’t have the title data scientist yet, like Jeff and DJ hadn’t, hadn’t come along. So we’re trying to interview people and we would always interview around the fundamentals of probability and statistics and stuff like that, like the really to see who had like a really deep grasp of it.
00:22:52
Because again, at the time, like when you got to Google, the problems were kind of like all flipped around, like in sort of academic statistics and in much of statistics, it’s like, I have a small data set and I’m trying to extract whatever signal I can from this very small data set. Whereas our problems were the exact opposite. We have, we are drowning in data, we have more data than we know what to do with, and our job is not to be confused by it. Like, and so to do that, we kinda had to like, you just approach things from a completely different perspective. And that again, builds on the fundamentals of probability, statistics, linear algebra. Right? And I kind of feel that way about like the, the ChatGPT or the large language model world we’re entering where like for a long time, like the computer, if the computer gave you an answer, it was the right answer like, like basically, right? It, it would either error or it give you like the answer that you asked for. And now we’re entering a world where like that’s no longer true. The computer will always give you an answer, but it may or may not be true. And so like our, our whole mental model of like how how we interact with computers is about to change in the same way that it, it had to like working with very large data sets back in the day
Jon Krohn: 00:23:54
And makes the testing problems that you were talking about even harder.
Josh Wills: 00:23:58
Exactly. Right. Precisely. Like all this, this great Bing Sydnee kind of stuff. Like, oh my God, like. Like how do, how do we even begin to think about this stuff and test for it? And so this is the great challenge for me in my data engineering class is like, how do I adapt for this new world that we’re entering? Like how do I, how do I like start preparing people for like how to work when like yeah, the stakes are gonna become that much higher and the systems are gonna become that much more complicated. Yeah. Good fun though. Good problem to have.
Jon Krohn: 00:24:23
Yeah. Recently, in Episode #655, Keith McCormick and I discussed that all data scientists should consider no-code options because you might find that they enable you to prototype more rapidly. Keith has enjoyed good fortune with the no-code tool KNIME, spelt K-N-I-M-E and more than 15,000 people have taken his “Intro to Machine Learning with KNIME” LinkedIn learning course. KNIME is open-source so it’s free to try, and with the link that Keith is providing to SuperDataScience listeners, his KNIME course is free too! All you have to do is follow Keith McCormick on LinkedIn and follow the special hashtag #SDSKeith. The link gives you temporary course access but with plenty of time to finish it. KNIME can save you time so check out the hashtag #SDSKeith on LinkedIn to get started right away.
00:25:12
Another big problem that you’ve talked about a lot related to data quality is integrating diverse perspectives in an organization when you’re making, you know, it’s supposed to be a data-driven organization, so you’ve got people like data scientists, data engineers, you’ve got relatively quantitative business people. And yet these different groups disagree on what quality metrics to test for and monitor. So how do you deal with that and is it part of what you’ve called in a number of talks, the infinite loop of sadness?
Josh Wills: 00:25:46
The infinite loop of sadness, yes. Yes. Is is it a part of the infinite loop of sadness? Yes, it is. It is, it is very much to an extent. I think the, the, the idea of, I guess for folks who’ve not heard of the infinite loop of sadness is that the stakeholders that kind of revolve around the data of a company generally report up through different parts of the org. So you might have business users reporting up through some, through finance, like sort of G&A kind of stuff. Some through engineering, some through product, some through customer success, whatever. There’s all these different stakeholders for data. There’s a data science team somewhere that’s wants to do sort of cool data science stuff, wants to get to like AB testing, experimentation, contextual bandits, machine learning problems. There’s a data engineering team which may or may not be part of the same organization, like could be like at Slack for instance, like analytics reported up through product at first and then reported up through G&A.
00:26:44
But data engineering always reported up through engineering. And so they’re worried about performance, they’re worried about cost, they’re worried, like especially these days like cloud savings, all that kind of stuff. And then like you have the rest of engineering, which is like, you know, trying to like terraform things and monitor things and observe things and keep the the basic product up and running and stuff like that. And so, and then they’re like, oh yeah, there’s the data warehouse thing on the side that we also kind of need to like provision some resources for and you know, terraform or whatever, right? And yet, like, yeah, like the artifact of having, or the result of having all these different stakeholders in these different organizations with different objectives, with different goals is, is kind of this mess. That leads from like business people hiring data scientists.
00:27:27
Data scientists say we need data engineers. Data engineers say we need machines. The rest of engineering says, okay, we need a bunch more money to pay for all these machines. The business is confused because they’ve hired all these people and spent all this money and don’t have any answers to, to their questions yet and stuff. And so these, and these things kind of spiral out of control. I don’t have a great answer to this, unfortunately. I’d like to say that I do like, I, it’s, it’s something to identify a problem. I guess where I’ve seen it work well, or the best places I’ve seen it work is where data is centralized in a single organization under a single leader who has like the trust and confidence of the rest of the executive team. This doesn’t always happen. In fact, it hardly ever happens. And, and even when it does happen, it’s kind of an unstable situation because once that leader leaves it, it tends to be the case. The whole thing kind of falls apart.
Jon Krohn: 00:28:23
And it sounds like somebody like that would be a prime poaching target because there was..
Josh Wills: 00:28:27
Oh, I mean a hundred percent, a hundred percent. I mean, it was, it was sort of the, the joke to me I think was that when I was when I was director of data engineering at Slack, I was essentially being continually recruited to be the head of data at Lyft, the head of data at Discord, the head of data at Snowflake, like more or less continuously. And I kind of thought like, we should all like, get together, like all of us, all of us, various heads of data who are constantly being recruited and like we could just like collaborate more or less. We could just like set out our terms and say like, okay, you go here and do this, and then I’ll go here and do this. And we can just calmly, constantly keep switching between, you know, like the orgs until we fixed everything.
00:29:00
This is, this was my fantasy. I never actually, actually did this, but yeah. I mean, but yeah, you’re exactly right. If, if you are talented enough and skilled enough to, to do that job, and it’s like, it’s like no joke really hard to wrangle all those different perspectives and get everyone sort of rowing in the same direction. It’s not like a skill I asked, for instance. But yeah, for the folks who can, it’s, it, they’re tremendously valuable. And I don’t think it’s a coincidence that you’ve seen all those people go on to be tremendously successful founders, investors, all those kinds of things. Yeah. Absolutely. Because it’s, it’s super hard without a doubt.
Jon Krohn: 00:29:31
Yeah, and we’re gonna get into this later in the episode. You know what it’s like managing these teams at these fast growing companies. We’ve got some stuff, we’ve got topics on that coming up. For now, I’d like to switch gears to note how not only are you renowned within organizations for the leadership thought leadership or literal management that you’ve done in the past in these really well organizations. They’re also very well known publicly. So you are Twitter famous for defining what a data scientist is in this very comical tweet that I cite all of the time to define myself. Which is that a data scientist is a person who is better at statistics than software engineers and better at software engineering than statisticians.
Josh Wills: 00:30:21
Statisticians. Yeah. Yeah, exactly.
Jon Krohn: 00:30:22
Yeah. So this was written over 10 years ago. Do you think it’s still true?
Josh Wills: 00:30:26
Oh, I do, I do think it’s still true. I do think it’s still true. I, it’s, it’s funny, I think I hang out a lot with the, the analytics engineering community now these days, like the folks around DBT and stuff like that. And so there’s a wonderful woman in that community named Emilie Schario, who just, just started her own company called Turbine. And she gave a talk last year called Down with Data Science, Down with Data Science. And she’s very, she was a sort of engineering manager, data leader at at GitLab and then at Netlify. And she, she does not like, she does not like the data scientist naming or title thing because she feels that it’s the term has become so overloaded as to become kind of meaningless, right? Like telling someone you’re a data scientist doesn’t really seem to tell, tell you anything about like, what exactly does this person do, right?
00:31:18
And I think that’s a fair critique. I think that’s, I think that’s a legitimate thing. I think, I think there are, like, I do think the data scientist title well, very useful and especially useful, like not for nothing at these like fast growing kind of tech companies I tend to work at where you do need a lot of generalists. You need people who can wear a lot of hats and stuff like that. It definitely does tend to break down. And I think I see, I like to, I’d like to think that over time data scientists kind of like go into like one of three different directions. And so can you, can you like, can you humor me? Can I, like, I haven’t, I haven’t done this before, I haven’t like pitched this.
Jon Krohn: 00:31:52
Not only will I humor you, I’m gonna take detailed notes.
Josh Wills: 00:31:54
Detailed notes. Detailed notes. Okay. So I think one direction, one direction data science can go in is kind of the direction is the direction of, of like going really deep into kind of methodology and stuff like that. And so this is stuff like the folks who do causal inference, contextual bandits, like very advanced AB testing kind of stuff. The folks who go to Google, Facebook, the Ubers, the Lyfts of the world, like where they have these really, really like legit hard statistical problems and stuff like that. And they go really deep on the stuff. To me, like always experimentation, a AB testing and, and you know, causal inference, contextual bandits, like how do we make decisions better with data has always been like the highest calling of data scientists in some sense. And so, like the Sean Taylors of the world I think of is like the archetype of, of this model.
Jon Krohn: 00:32:40
Yeah, yeah, yeah. You know, we’ve had him on the show.
Josh Wills: 00:32:43
I am not surprised by that. Yes, he’s, he is fantastic.
Jon Krohn: 00:32:46
Yeah. Yeah, he is fantastic. He was in episode number 617 and he very narrowly missed being in the top 10 most popular episodes of 2022. He was the 11th.
Josh Wills: 00:32:55
Yeah, he, I mean, he actually mentioned that to me when I told him I was coming on the show. He just wouldn’t, he was like so broken up about it. He was just, anyway, I don’t think, anyway, he may be broken up about it, I’m not sure. But he didn’t, he didn’t mention.
Jon Krohn: 00:33:07
Well, at the time that we’re recording, he doesn’t know yet. I haven’t, I haven’t, ah, that episode hasn’t been published yet.
Josh Wills: 00:33:12
Haven’t, haven’t broken it to him. Okay. Got it. That’s good. That’s good. So that’s, anyway, that’s one that is one route. Another route is sort of to become like a very deep domain expert in some, in some problem area. And there’s, there’s a, a data scientist I worked with at Slack named Ravi Menon, who went to become like the VP of like all of data at Slack at some point. And his thing initially was he was the, he was the expert on the domain of Slack’s pricing model. So like Slack’s, Slack had a usage-based pricing model, like back in the day, which you know, and now there’s like all these product led growth companies who all do this, but at the time it was a very sort of novel and innovative idea. And Ravi became the kind of absolute unquestioned expert on like Slack’s pricing model.
00:34:00
How did it translate into like the core business metrics of MRR, ARR, how did it relate to our enterprise deals? Like, he really just became like, this was like his thing, this was his domain, obviously like super, super important in the company. And again, there are things like this for folks at Google, for folks at Netflix, like for all these different things. There’s an area where like there’s this really, really important, very, very quantitative business domain and like the data scientist owns it and then they, they expand beyond just being like the data scientist for it to being like the leader for it. And that’s, that’s sort of like the next, next path. And then there’s the final path, which is the one I took, which is the like data engineer in denial. And that, that was more or less me, like I was, I had done like a little bit of statistics and stuff like that as part of my, my graduate school stuff and my undergraduate stuff.
00:34:47
But really I was just like a software engineer who had not like discovered software engineering. And to me, honestly, Jon, in terms of like the engineering work I’ve done, like rebuilding Slack’s, search indexing pipeline, which is the, the pipeline that powers like the backend indexing system for Slack search is still to me, like the thing I am, I am most proud of, like by far, like that was my, it was so hard and it was, it was like every possible nightmare data pipeline problem you can encounter, we would encounter like a million times basically cuz of the volume and the scale, the data and stuff. And I love that, man. I love that. That was just, that was just such, such a favorite to me. And so that was like my favorite thing I got to do. And so I feel like my whole career was like working my way towards like kind of finding that problem and getting to work on it.
00:35:34
So yeah, those, those are my big three. So data scientist to me is still super useful, especially for generalists, especially like early stage startups, but still like, to me, an unstable kind of profession, not something that you can maybe do like, and, and inevitably I feel like you get, you gravitate towards one of these sort of three different outcomes of like super, super quant researcher, causal inference, experimentation. Domain expert in some sort of quantitative domain. Or finally like just data engineer because back to the theme of like high quality data, simple analysis on it just really solves like a large, large set of problems. So yeah. Yeah. That’s, that’s that’s my answer, man. Yeah.
Jon Krohn: 00:36:12
I usually come in and like recap, you know, on these complex topics, especially if it’s like an enumerated list. Yeah. I love coming in back, back in and being like number one, number two, number three. You just did it
Josh Wills: 00:36:22
Sorry, I didn’t even think about it, but Yes, exactly. That’s right. Yeah.
Jon Krohn: 00:36:25
Stop stepping on my toes, Josh.
Josh Wills: 00:36:26
I know. Sorry Jon, sorry about that. This is your show, not mine. Anyway.
Jon Krohn: 00:36:29
I get to summarize, you introduce, I summarize. Come on.
Josh Wills: 00:36:35
It only seems fair. It only seems fair.
Jon Krohn: 00:36:37
I love it. That was, that’s, I, I, those definitions are great. Makes perfect sense to me and I don’t need to recap it for the audience. It’s actually, it’s perfect. We can just, we can move on.
Josh Wills: 00:36:47
Just leave it there. All great. Yeah.
Jon Krohn: 00:36:48
Cool. We can move on to the next topic. So we’ve, we’ve talked about, we’ve talked about your course and how that relates to data quality. We’ve talked about how your famous for coining data scientists, and that’s led to this very interesting discussion of what data scientists really are. Yeah. You know, and how the definition has, it’s too broad and, and I agree with you know, you don’t really have a sense of what somebody’s doing.
Josh Wills: 00:37:14
It’s a starting point, but it’s sort of an amorphous thing. It’s a blob. And it can go in any different ways. And I think just you should go into it like thinking that that’s gonna go one of those three ways. Would, would be my, yeah. Anyway. Like, like people go do consulting, right? What the hell? Like, they consult for Bing or McKinsey and like something, what the hell is that? I have no idea. It’s the same kind of, it’s the same kind of job. Yeah.
Jon Krohn: 00:37:34
Well, so now let’s dig into what you have been doing very specifically in, in your past role. So up until recently you worked at WeaveGrid, which leverage machine learning or leverages machine learning to decarbonize transportation. Sounds super interesting.
Josh Wills: 00:37:52
Yeah, absolutely.
Jon Krohn: 00:37:53
How did you, how did you get into that role? What did you do there?
Josh Wills: 00:37:57
Yeah, for sure, for sure. So I got into the role so I guess I left, I left Slack in November of 2019, and I took some time off, and then of course the pandemic happened and stuff, and we talk about that later. But I had been feeling for any number of years like I should do something to help the climate as best I could. And so I was like, like a lot of other people in tech, I was very focused on like climate and like how can I help? And I think one of the, the main challenges I think a lot of us discover is that there aren’t actually that many ways for us to help. Like, just be perfectly honest. Like there’s a, like a lot of fantastic work is being done by a lot of scientists, by a lot of policy makers, by a lot of people and stuff like that. But it’s not like there’s not like some super hard machine learning problem at the core that if we cracked it, like we solve everything, right? Yeah. So we have to, we kinda have to help on the margins, right?
Jon Krohn: 00:38:47
We’ve had a couple of cool episodes about this, it’s a while ago now, but in the spring of 2021. So two years ago we did two episodes in a row number 459 with Vince Pettacio and 461 with Sam Hinton, where both of them talked about this, like how they are using data science or how you could be using data science or machine learning to tackle climate problems. And it’s exactly as you say, it’s not like AI, like finding you know, being able to crack artificial general intelligence where clearly working on data science is going to directly be involved in that. It’s like climate change. It’s on the periphery. It’s not like, oh, great work. You made that model and now nuclear fusion works.
Josh Wills: 00:39:27
I mean, precisely. Exactly. Which, I mean, maybe someone’s doing that, which, which would be great. Like let’s, let’s throw a deep mind at like stabilizing, you know, the, the magnetic, the, the, the [inaudible 00:39:36] or whatever, however you say that stuff.
Jon Krohn: 00:39:38
You know, you know, I say that…
Josh Wills: 00:39:39
Someone’s doing that.
Jon Krohn: 00:39:40
I make that joke. And literally we had Dr. Brett Tully come in and episode 533 and talking about exactly that.
Josh Wills: 00:39:46
Exactly.
Jon Krohn: 00:39:47
He was using data science to simulate aspects of how fusion could work. But yeah, it’s, it’s nevertheless fantastic. It, it’s not, it’s not, there are various kinds of fringe problems, ways we could be using data science and all of those add up. Of course.
Josh Wills: 00:40:02
Yes, they do. They do. I guess I think of it in terms of like, at least with WeaveGrid is like, there will be second order problems as a result of how we approach things. And then the question is like, how do we, how do we address the second order problems? So WeaveGrid’s like focuses is addressing a second order problem that arises with the growth of electric vehicles. Which is that we are now going to have, like over the next decade, like a ton more electric vehicles 85% or so of charging of electric vehicles happens at home. And a sort of proper level two charger, which is like something you need to kind of like fully charge a car from zero to a hundred over the course, of course of an evening draws about as much electricity as six refrigerators. And so the, the base load that the grid is going to have to support is about to go up significantly. Like we’ll go up significantly over, over the next few years. And the grid is, you know, first of all, like the, the truly the electrical grid is like one of the engineering marvels of the 20th century. Like, and in fact, I literally, there was like some kind of survey or something and people said, yes, actually that was the, that was the greatest human achievement over the, over the 20th century
Jon Krohn: 00:41:15
Ah, yeah, yeah, yeah.
Josh Wills: 00:41:16
But the electrical grid is not built for a world in which residential in particular, like electrical use is going up, electrical use residential has been going down in America for pretty much my entire life. Like it peaks back in 1980 because everything is getting more efficient. Our refrigerators, our dishwashers, everything is laughably comically insanely more efficient, right? And so EVs are gonna disrupt that. They’re gonna change that. They’re gonna like shake up the system and shake up this gigantic machine that is not especially smart. The grid is very good and very stable, you know, outside of Texas, sorry, I couldn’t help myself, [inaudible 00:41:54] Texas, but nonetheless for, I just my other sort of nerdy issues with, with Texas and their approach to, to the grid stuff. But anyway but yeah, but it’s not a smart system. It’s, it’s, it’s a dumb system. And so it’s, it’s not aware of like, there are cars and those cars can be connected sometimes and not connected sometimes and stuff, right?
00:42:15
So the question then is like, how do we leverage the internet and all this computer technology that is built on top of the grid, by the way, to make the grid smarter? Because the reality is we can like analyze all this data. We’re getting to see where the electric vehicles are, we can anticipate when electric vehicles are going to get home. We can anticipate how much charts they’re going to need when they get there. And if we can anticipate these things that we can plan for these things, but we can ensure that when they do come home, when they do need charging, the grid keeps working.
00:42:45
We don’t overload the grid, which is like, again, like in some places in San Francisco in particular where we have like a absolute, you know, ton of electric vehicles, Teslas in particular, we have to change out the transformers and the substation of infrastructure because like the grid isn’t built to handle like 20 EVs on one block. Like, it’s just not built for that. Right? So this is, this is like what WeaveGrid does. Like this is, this is sort of their business is like they are building the systems that we need to do what is called managed charging. Managed charging is essentially ensuring that for all this charging’s happening on the grid, we can schedule and optimize all of it so that we don’t ever overload transformers, substations, all that kind of stuff. We keep things under kind of the rated limits for this stuff so that it doesn’t wear out too quickly.
00:43:31
While also simultaneously making sure everyone’s car gets charged and people get where they need to go and all that kinda good stuff. Yeah. That’s the business. It’s, it’s, it’s kind of, I don’t know, man, it’s just, to me it’s just like a nerdy, fun, exciting time. Like these two gigantic monster industries of, of utilities and like, and car manufacturers are like colliding into each other, like, you know, and they’re just absolute beasts between the two of them. We at WeaveGrid and a few other folks. It’s these little tiny companies right there trying to like, you know, smooth out the, the integration between these two things. It’s just kind of cool. So, yeah. I built a lot of their infrastructure when I was there. Like I built like data infrastructure obviously, but like the vehicle monitoring systems and the actual, the very first managed charging system I built, I built it with the one, the one WeaveGrid ran for one of our utility partners. I wrote, and that was like, again, kind of, you know, absolute honor, privilege to get to have the opportunity to do that. I was just, anyway, it was like super, super fun. Really, really cool stuff. Yeah.
Jon Krohn: 00:44:30
That, that’s sounds amazing. I love that, it, and this is a recurring theme in your career, and we might as well dig into this right now.
Josh Wills: 00:44:39
Oh, sure.
Jon Krohn: 00:44:40
Is that you have, you’ve had, you’ve dug really, really deep. So whether you were at Cloudera or Google or Slack, you dug really deep into solving those company’s problems with respect to data, data analytics, machine learning, and then you would kind of come up for error.
Josh Wills: 00:45:03
Yeah.
Jon Krohn: 00:45:03
And find something really meaningful to tackle, whether it’s reading 50 books on decision making, or climate change,
Josh Wills: 00:45:13
Midlife crisis. Yeah. Anyway,
Jon Krohn: 00:45:16
Tackling climate change, tackling the Covid pandemic. We’re gonna get, gonna get to that later on. So you, you, it seems like I don’t know if it’s exactly like interdigitating like flipping back and forth between the two, but this, this idea of, you know, digging into something really big commercially for a while and then finding something more meaningful to contribute to taking these kinds of sabbaticals like you’re taking right now. We’re also gonna get into that later. Cause you’re, you know, what you’re doing in your sabbatical right now, other than your course is fascinating. The kind of the open source work that you’re doing. Yeah. And so, yeah. So I guess, do you want to comment on this approach of taking sabbaticals and is it something that you’d recommend to people in general?
Josh Wills: 00:45:58
Ah, it’s a great question. That’s a great question. It’s something I am extraordinarily privileged to be able to do I think. Like, you know, with without a doubt I feel very, very fortunate that, that Google and Cloudera and Slack have afforded me the freedom to do this. Like yeah. And I just, and sometimes Jon, it’s, it’s kind of feels like the absolute least I could do you know, right? Like the very, like the, yeah. To be, to be so privileged and then like, yeah, contribute like back to humanity, sort of wherever you can. I think of a lot of it. Like honestly, man, it’s just like I, I, you know, I got a kid, he’s seven, right? And the other thing with the pandemic, at least out here in San Francisco was in September of 2020, the sky turned orange because the wildfires were so, you know, you remember that, I’m sure it was big national news thing. Right. And I will never forget getting woken up by my son on that day. Like, he’s like, he, he has a, you know, timer in his bedroom, so he can’t wake us up before seven. And he wakes me up and he’s like, daddy, it’s, it’s seven, but it’s still dark outside. It’s like, oh my God. And, it’s like, you know, living in the postapocalyptic kind of thing.
Jon Krohn: 00:47:05
What, what do you mean as a timer in his bedroom? Like locking the door or something?
Josh Wills: 00:47:08
No, it’s not, not locking the door.
Jon Krohn: 00:47:09
He literally can, he literally cannot disturb us until seven o’clock when the timer goes off.
Josh Wills: 00:47:13
No, no. He, he, he absolutely can. He absolutely can. It’s just when the, when the light turns green at seven or whatever, he’s allowed to like come out and like watch TV or whatever. Like, like that kind of thing. That, that is the rule. But yeah, like more than anything, it’s like when I imagine a future in like 10 or 15 years, when he asked me like, “Dad, what were you doing when the sky turned orange?” And I like, wanted to have a good answer. Wanted, wanted to have a good answer. So more than anything, I think it’s like, I don’t know, like, it, it seems like the, the regret minimizing way of living your life is like the, you know, when your kid asks you, what were you doing with your time? Like having a good answer is, is kind of what I’m after, more or less. Yeah. So that’s maybe that, that’s my framework these days. Yeah.
Jon Krohn: 00:47:58
I thought the way that you were gonna go with that was like making a better role for him, but it’s just about avoiding regret.
Josh Wills: 00:48:03
It’s about avoiding it, it’s about avoiding regret.
Jon Krohn: 00:48:05
That’s an honest perspective.
Josh Wills: 00:48:07
Yeah. Well, it’s because like Warren Buffet said something to the effect of like, you know and when I was reading all those, you know, decision making books and stuff, read, read a lot of Warren Buffet and a lot of Charlie Munger, a lot of Berkshire Hathaway stuff is like, like Warren Buffet, like, what’s success for you? It’s like the people who you want to love you, love you back, that’s what you want. That’s, that’s success. And I’m like, that’s a very useful like framing of the problem. That sounds pretty good to me. So yeah. Yeah. Working in that direction. Absolutely.
Jon Krohn: 00:48:35
Cool. So yeah, well now we’ll flip back to just really nice cold, hard commercial data stuff.
Josh Wills: 00:48:43
Awesome. Thank God.
Jon Krohn: 00:48:44
Much more comfortable.. So before WeaveGrid, you worked at Slack, we’ve talked about that you know, we’ve alluded to that you were their first director of data engineering.
Josh Wills: 00:48:52
Yes.
Jon Krohn: 00:48:53
And then you discovered in that role that you don’t like management.
Josh Wills: 00:48:58
Not that much. Not that much.
Jon Krohn: 00:49:01
So it sounds like from footage that we found of you speaking, that you essentially demoted yourself.
Josh Wills: 00:49:07
I don’t know. So I don’t know if demoted is the right word. I think in the, in the tech, in the, well, so in the tech industry, we try not to say that like, people harp on this, right. It’s a, it’s a big deal about like, IC and management are…
Jon Krohn: 00:49:18
Independent contributor track versus management track.
Josh Wills: 00:49:21
Yeah. Exactly. Versus the manage track or, or like, or lateral moves. They’re not, it’s not a demotion. And I would say like Slack was a place where these kinds of moves of folks going into management and saying, you know what, this is not for me, was very much like tolerated to the point of being encouraged. Like my, like my manager, like my director counterpart who was, he was head of all, like all of our platform engineering stuff. He left management to go back to an IC role, like a couple of like maybe a month or two before I did. You know what I mean? And there were like just a bunch of, bunch of friends I had there who did that same kind of move. Like Slack, I think too, it’s credit took engineering management very seriously. Like very, very seriously as a practice, as a discipline, as like a thing you were serious about in a way that like, I think a lot of other tech companies don’t necessarily, where it’s like, oh, just it’s a thing, but like, really it’s the, it’s the tech in the building. It’s the important stuff, right? And Slack was not like that.
00:50:21
And that was a great place and a great environment to discover that something was not for you because there was no, like, you couldn’t, like, I feel like a Google for instance. Like there’s this sort of notion of a tech lead slash manager where these two roles are kind of intertwined together. Slack does not have that concept. There is a tech lead who is person A and there’s a manager who’s person B, and, and they are counterparts and they work together, but it’s not the same person. And there’s cost to go with that. The communication costs obviously are go up significantly, but there are a lot of benefits as well. Like, you don’t have this problem. Like I think a lot of sort of abuse and sort of bad practices we see in employment is when you have one person who has essentially like complete power over the control over the career of someone else.
00:51:06
And Slack didn’t have that cuz there was a tech lead and there was a manager and they were both important, right? And they both played a role here. And so anyway, like there, there were virtues to Slack’s, Slack’s approach here. So yes, guess I was, I was at a culture where like, you know, quote “demoting yourself” or, or moving, moving back to the IC role was not a thing. You know, it was not, it was not a big deal. I was fortunate as to be like, not so far into my management career. I talked to a lot of people who are like, who want to do this? Basically. Who like, you know, like Josh, I’ve, you know, the joke I made at that, that one of my talks was you know, engineering management is, is a pie eating contest where first prize is more pie.
00:51:46
Like if you just go kind of up higher, the management hierarchy, there’s just like, there’s just a lot harder management problems up there. It’s not like, it’s like doesn’t get fun at some point. Right? And so a lot of folks want to leave management and go back to being an IC, but they’re afraid that they can’t, they’re afraid they’ve been doing it for so long that they’re afraid their skills have atrophy and stuff like that. And they might be right. I’m not really sure. I don’t, I don’t know. I just know that I did not do management long enough and I didn’t do it well enough for my technical skills to really atrophy. I kept doing tech stuff pretty much all the time. Right? And as a result it was a pretty easy transition for me. So, I don’t know, it’s not, it’s only a hero thing to the extent to which your’re if your identity becomes wrapped up, you know, I am an engineering director, I am a VP of engineering, this is who I am.
00:52:35
Like this is, this is me. And then if your, if your identity like, and your sense of self-worth becomes wrapped up in it, I get why it’s super hard and super scary to leave. It’d be terrifying. I get that. And you know, to tie this to an earlier conversation, I think one of the advantages I felt like I’ve always had as a data scientist or identifying as a data scientist is I’ve never been embarrassed to not know something about software engineering. Like, cuz I don’t know a lot of things about software engineering cuz I know, you know, learned a bunch of stuff about statistics and simultaneously, I’ve never been embarrassed not to know something about statistics. Cuz I’m not a statistician, I’m not a software engineer, I’m a data scientist. Like my job is the intersection of these two things. My job is to be able to like, ask intelligent questions of people on either side of this divide who know a lot more than I do. And like, it’s fine. That doesn’t threaten my identity. Doesn’t mean I’m a bad software engineer or a bad statistician cuz I’m, cuz I’m not that I’m a data scientist. Right. And that has always been, I think yeah, like figuring out like a sense of identity where like you have the privilege to ask stupid questions to people like is a, is is very much a superpower. Like, kinda like being a podcast host I would imagine. Like it’s, you know, anyway. Yeah.
Jon Krohn: 00:53:44
I do get to ask a lot of questions. And I am, I am comfortable not knowing a lot of things. That’s, that’s why your quote, which you just referred to again, you’re really famous tweet from a decade ago. Coining a data scientist. That’s why it resonates with me so much and why I use it all the time.
Josh Wills: 00:54:05
Good, I’m glad. Yeah. That’s awesome. Yeah.
Jon Krohn: 00:54:08
Cool. Well, so that really answers. I had lots of questions for you about engineering teams and independent contributor versus management track, and you answered all of the further questions I had on the topic for you.
Josh Wills: 00:54:21
My favorite, if I can have a line there, my, my friend Keith Adams, who’s who was the chief architect at Slack, just said the nicest thing to me ever. It’s like one of my favorite compliments is “Josh was far too good of an engineer to be wasted in management”. And like, that was again, like heart explosion, kind of like nicest, nicest things people have ever said about me that’s by far one of my favorites. So yeah. Yeah.
Jon Krohn: 00:54:40
Nice. It’s kind of how somehow the other way around sounds like an insult where you’re like, too, too, look, you’re too good at management. We gotta get you in engineering.
Josh Wills: 00:54:49
Yeah, exactly. I mean, I mean, I mean that’s, it’s some truth of that I think. Exactly. Anyway yeah, there’s, there’s definitely pejorative versions of it, but I think it was a nice counterpart to the sort of notion of demotion, I think is what I wanna say. Is, at least in Keith’s sort of hierarchy. The IC engineer was a higher status role than the, than the manager director role.
Jon Krohn: 00:55:10
Oh yeah, you see that in meetings all the time where you have the IC is often, or the tech lead as you described it, is often in meetings. Clearly the person who understands this problem best and is the person that you should be listening to. And a lot of being a good manager is just listening to that person’s opinions and making sure that their perspectives all are propagated through the organization effectively.
Josh Wills: 00:55:32
Indeed, indeed. And, and then again, and sort of the reverse thing too, making sure that IC, that tech lead has the context on the business and everything else that’s going on and stuff. That is your job as the manager to keep track of and keep a pulse of so that they can be as informed as possible. Right. And I was not good at that latter thing cuz I didn’t really care about anything that was happening in the organization that was not like the data stack. And so I was not the person to be, you know, it just wasn’t the right thing for me to be doing just that simple. There’s a lot of stuff about, there’s a lot of stuff about managing. I love, by the way, I love hiring, I love talking, I love talking to candidates. I love understanding someone’s narrative arc, you know, understanding like their hero’s journey. Like where, where are they? Like what, what do they wanna be when they grow up? How do we help them get there? I love one-on-ones. You know, like, I think every, every sufficiently like senior job is essentially like a therapist more or less. And, and management is a lot of, you know, just listening to people like, you know, discuss their problems and stuff like that. These are all the things I loved. Sprint planning, not really for me. Like just don’t, I just like, yeah. I just, I just wanna kill myself like sitting through and stuff. I just, I can’t do this. I hate this. Yeah.
Jon Krohn: 00:56:40
Yeah, the, the therapy thing for sure. And one-on-ones can be super rewarding.
Josh Wills: 00:56:45
Yep, absolutely.
Jon Krohn: 00:56:45
I have a friend, I won’t even mention where he works, but he works at one of the biggest tech companies in the US And he, he is an amazing software engineer, grew to a very high level of management and eventually realized that he was basically only a therapist. That was what he did as a job. And I think what helped him realize that is that his partner is also a therapist, is actually a therapist. Like there’s license actually.
Josh Wills: 00:57:15
Real, like real therapist. Exactly. Precisely. Yeah. Totally.
Jon Krohn: 00:57:17
And so it was, I think they were getting home and having these conversations with each other and being like, huh, our day was the same.
Josh Wills: 00:57:22
I have like, I have like CEO, I have like CEO friends who are like, yeah, I’m basically a therapist. That’s more or less like if I was to do it over again, like, the best CEO training I think is to be a therapist.
Jon Krohn: 00:57:35
Yeah. Come into my office, sit down, tell me what’s wrong.
Josh Wills: 00:57:38
Pretty much how can we, pretty much, man that’s, yeah. That’s the job. Exactly. Yeah. And then for customers and for everybody, like, not just, not just your direct reports, not just everyone at your company. Yeah. Everybody, yeah. Is the folks who are good at it are just great listeners. Anyway. Yeah.
Jon Krohn: 00:57:52
Yeah. So great listening, hiring. These are things that you’ve done not only in Slack, but also you led teams at Google. At Google, at. I just, I just wanted to get together Cloudera and Google.
Josh Wills: 00:58:02
Yeah, exactly. I was like, exactly.
Jon Krohn: 00:58:05
Yeah, yeah. So, so yes, you led teams at Google and Cloudera in addition to Slack and at all three companies you were there when they grew massively. So how did you keep up with those pressures? You know, external pressures, internal pressures. How did you ensure that your team met expectations, that kind of thing?
Josh Wills: 00:58:25
Oh, that’s a great question. It’s a great question. I worked super hard. I don’t wanna say I worked all the time. Ah, it’s, it’s still, I mean, like, I wish, you know, I wish there was a great, you know, like I, it was like, oh Jon, it’s my three step guide and you can buy my book to learn how I did it. It wasn’t, I mean, I wasn’t, I had, I had no work-life balance. Whatsoever for like, the better part of a decade. I kind of like reflect on, especially those, those two years of Slack where I was like building the team, building an infrastructure like the first couple years there. And I’m like, I don’t really remember anything Jon. Like, it’s just kind of like a blur to me. Cuz I wasn’t sleeping, I was just like working all the time and I had like a newborn and stuff, and it was just like, it was, it was just so hard.
00:59:08
I don’t understand. I don’t remember how I did it honestly. I love, love, love, love work, work-life balance and stuff. And I love, I love, again, getting to be a dad and like getting to just go to basketball practice and stuff like that. Like, I love, love, love, love this stuff. Simultaneously, I have to be honest, and this is something else, I’m, I’m criming from Keith Adams as well. I have to be honest about the fact that the only reason people listen to me, or like I’m a, I’m a guest on this podcast show is because of things that I did when I had thrown work-life balance completely out the window and like worked obsessively. I once worked, I once worked 24 hours straight on my birthday, like in my twenties. Like, that was just like how like absolutely obsessive I was. Right. And I like look back at that and I cringe like, oh my God, like, what was I thinking?
00:59:54
Freaking idiot. But yeah, that was, that was what I did. And, you know I mean, I don’t know. I loved it. I think for better or worse, I lived and breathed this stuff, you know, I think I thought when I was at Google, I thought about Google like nonstop for like four years. I mean, when I, when I quit, when I left, I was, my last job at Google, I was working on like building the data infrastructure for what eventually became Google Plus. And I was, I was massively unhappy and like, just super depressed. It was for any number of reasons. And I think it was like, when I finally quit, I remember like, it was the f I was super busy, like quitting, like quitting was very time consuming. Like giving notice and quitting. And then, like, I remember that first night where I was no longer employed by Google, like having a panic attack basically, because like my whole brain had been thinking about Google constantly for four years. And then it was like, that was no longer an option. I was like, I mean, I could think about it all I wanted, but there was nothing I could do about it anymore. Right? There’s nothing left to be done. And like yeah, just that really sort of painful, painful, horrible night. Yeah, man, I just like, like I said, there’s like, there’s no tips or tricks. I did all the unhealthy things you do when you don’t know the tips or tricks. That’s why well,
Jon Krohn: 01:01:07
You, you say no tips or tricks. But I, we did dig one up from a talking given at ODSC you talked about how good decision making needs a lattice of mental models.
Josh Wills: 01:01:19
Yes, that’s true. That I did. And that was, that was part of my like, kind of after all of that stuff, I think like when I started coming up for error a little bit especially like I was, sorry, you know, I was at Google for like four years and then Cloudera for four years and then Slack for four years. And so like this 12 year period where I worked more or less continuously for, for these companies doing this, doing this stuff. And my last year at Slack, I was just super, just deeply and profoundly fried and burned out. Like totally burned out. It was, they were, they were again, very kind to me. And just let me kind of hang out for a year like basically, I don’t think I did much of like useful stuff during that last year, but nonetheless and I started, I got into like the Farnam Street community and like sort of general like decision making kind of stuff.
01:02:08
Like so, you know, Thinking in Bets by Annie Duke and then yeah, a lot of stuff around like Charlie Munger, like Warren, Warren Buffet, like that kind of stuff that I talked about before. And that was what led me down, like, yeah, getting super, super interested in, in like how do we make decisions? How do we make decisions as individuals? How do we make decisions as organizations? What makes for effective data teams? Like why was, why was data so easy at Google and so hard at Slack? Like, like stuff like that. Why was, why did it work at some companies and not others? Like all, all these kinds of things. And so yeah, I think, I think more than anything, what I, what I came to realize, I think when I think for better or worse, I am very, like Google very much shaped me really.
01:02:54
Like, they took me as like a fairly like unmolded, you know, whatever bit of clay. And they turned me into, into what I became. And so I kind of came out of Google a fairly unapologetically data, data, data, experimentation, OKR, all the things kind of, kind of like zealot, basically. That’s what Google turns a lot of people into that. And at Cloudera, I kinda like preached that gospel, you know what I mean? I was like, really it was like Hadoop and MapReduce and then, and then Spark and all that kind of stuff. But really it was about like, we can use data to make better decisions if we use these tools. If we, if we do the stuff, we can make better decisions. And so I was like, I was the prophet of that. And then when I got to Slack I kind of collided into an organization that really did not work that way, that really didn’t, did not use data to make decisions.
01:03:39
It was very much like driven by vision, intuition, design principles, like, like things that were not, like things that Google, I would say like put relatively less emphasis on, I guess would be the nice way to say it, I guess when I was there. And so that was very hard for me. That was like a culture shock to go to Slack and like say and discover that like there was this other way of working that actually like, kind of works pretty well and has like a lot of virtues to it and stuff. And it doesn’t actually have, like, not everything has to be AB tested and stuff like that, right? And so kind of on the, on the other side of that, that was what led me to kind of a much more integrated perspective on decision making where it’s not, we don’t just hang our hat on one thing.
01:04:21
We combine any number of things together in order to come up with the best decision. There’s a, it was, it was funny Jon. There’s a, it was a great it was a, it was a, the book about, what was it called? It was about the Houston Astros, it was the book the, the guy wrote about the Houston Astros and like, their sort of journey to the World series. And I, and I gave a talk about it. It’s, it’s one of my one of my talks from 2019 is about it’s, it’s called the talk is called How to Play Well with Others. It’s one of my favorite talks. It’s like one of the, you know, one of my favorites. But it was talking about how the Astros, the Houston Astros integrated kind of Moneyball, you know, very metric style stuff with like more qualitative measures and stuff like that.
01:05:00
And that sort of be, has become kind of my dream aspiration I think is like, what does an integrated sort of like, design user research, like data science kind of perspective on these problems look like and stuff like that. Like how do, how do we sort of tie these things together to make better decisions as a whole? Is, yeah, is is kind of my great, my, my my great sort of curiosity and obsession right now. It was, it was a bummer that the Astros apparently cheated to win the World Series or something. It kind of like blew a hole and that otherwise awesome narrative. So I’ll, I’ll never forgive them for that, but nonetheless, yeah,
Jon Krohn: 01:05:31
The three pillars of success in the lattice of mental models are a quantitative approach, a qualitative approach and cheating.
Josh Wills: 01:05:38
Yeah, I get it. Exactly. That’s basically it. That’s, I mean, yeah, you need, you need a lattice of mental models precisely. Like exactly, exactly that. Yeah.
Jon Krohn: 01:05:46
Under the box thinking, how can we break this?
Josh Wills: 01:05:48
I mean, precise. Exactly. And absolute like, sort of yeah. Willingness to win no matter what, apparently, I don’t know. Yeah. I don’t know. Things like, [crosstalk 01:05:56]. Yeah. I mean, sort of interesting question here. Yeah, exactly. Like to what extent when does like quantitative or sort of like obsessions with winning kind of lead us down these sort of like morally dark paths, right? Like when is it, and I, like, I, I’m not, you know, super happy with all the things Google has done since I left. You know, I’m not super happy with some of the people who are still there from when I was there. And anyway, things have gotten, things have certainly gotten darker over time as the need for more profit, more money, all that kind of stuff like his, but so I mean, twas ever thus such as life. Yeah.
Jon Krohn: 01:06:34
I mean they literally dropped the don’t be evil slogan.
Josh Wills: 01:06:37
They did. They did. Except they felt like it was holding them back. Certain, anyway. Yeah.
Jon Krohn: 01:06:45
So we talked earlier in the episode about how you’ve kind of flipped between the commercial kind of work like at Google or Slack. And so let’s flip now back to something you did that I think we can say is unilaterally not evil.
Josh Wills: 01:07:02
Yeah.
Jon Krohn: 01:07:04
Which is that at the start of the Covid Pandemic, you were asked to be part of a team of volunteer experts assembled to build tools for modeling and forecasting the early stages of the Covid-19 pandemic in California. Can you tell us about that experience and what kinds of tools you used, how did, like the problem you’re faced with and how you tackled it?
Josh Wills: 01:07:25
Yeah, yeah. Absolutely. Absolutely. So I think, you know, it, it’s, it was a, it was obviously a scary time, I think for everybody. So it’s like you really kind of gotta, to the extent that anyone wants to transport themselves back to March of 2020 you kinda have to kind of go back in time to when we didn’t know anything and we’re just sort of doing the best our could, best we could in the face of a very new and very, very scary disease. You know, sort of took over the world for a bit there. I was unemployed. I had left Slack in the like in November, 2019. And so I was just kind of hanging out with not much to do. And my old friend DJ Patil who’s used to be the chief data scientist at LinkedIn and Chief Data Scientist of the United States still very like plugged in, networked into governments and all that kind of good stuff gave me a call and was like, “Hey Josh we’re assembling a team to help out the Department of Public Health in California. With their sort of forecasting. We’re trying to figure out, like we’re trying to figure out everything. We’re trying to figure out, like do what sort of lockdown, what sort of like shutdown system do we need? Where do we need ventilators? Where do we need, like, do we need to convert a stadium into a hospital? Like what do we need to do, basically?”
01:08:39
And so they were working with a team of epidemiologists at Johns Hopkins, like basically like postdocs PhD students who honestly had been, you know, living this nightmare for likely a lot longer than I had and had not slept. And had been essentially like running these models, like running these forecasting models, prediction models of like how the pandemic would go continuously for weeks. And our job, like what we were sort of called in to do myself, my friend Sam Shaw, a few other folks who were generally like ex Obama White House people that DJ had known from his time there was to like relieve them and give them a hand and help them out with anything they needed.
01:09:22
Because yeah, it’s like really what we were doing was like, so I was effectively doing, I think I was every in, in for a brief period of time I was like every data scientist dream, which is that I was your software engineer who was here to do any, anything you needed, anything you needed. I was at your back and call, I was your, I was your ChatGPT. Just to mention it one more time in this talk, right? I am here, I’m here at your disposal. If you have an optimization problem, I am here to help if you need, you know, and then this is kinda what we did. We took, we took this software they were running on a big machine at Johns Hopkins and we like dockerized it, containerized it, structured it into kind of a proper pipeline.
01:10:04
And we just, again, like literally the, the early stages, Sam and I were just like swiping our own credit cards at a w s to buy the biggest machines we could find to run like just more and more and more versions of these simulations basically as fast as we could so that we would have some idea of what was about to hit us, like what was our, our best understanding of like what was gonna go down. And it was like, yeah, it was a crazy 48 hours from when I started the project to when we like presented our results to the governor and the governor’s staff and stuff like that, like right before he shut down the state of California. And I was like, yeah. I was like, nothing like, like absolutely nothing like it. Like just, you know, and anyway, I was, I was saying about like the decision making stuff was like the midlife crisis and the, that experience that 48 hours was the cure I don’t really ever have to worry about. Like, it was, was my work meaningful, you know, kinda, kinda ever again, it was, it was deeply meaningful to me. Yeah, that was, that was what we did.
Jon Krohn: 01:11:01
Nice. I love that story. I’ve loved this journey through your absolutely fascinating career. So back to the present.
Josh Wills: 01:11:11
Yes.
Jon Krohn: 01:11:12
What kinds of tools do you end up using day-to-day? Like what’s your personal tech stack? Like? You’ve mentioned a lot of different really great tools.
Josh Wills: 01:11:23
[crosstalk 01:11:24] stuff. True. Yeah.
Jon Krohn: 01:11:26
But like, day-to-day, what, what’s kinda like your bread and butter?
Josh Wills: 01:11:28
I’ve gotten into sort of much more sort of small scale stuff these days, so I don’t, I don’t do much in the way of distributed system stuff anymore. I, you know, computers have gotten really good. My, my my latest fascination, which I think a lot of folks who follow me on, on Twitter would know about is, is DuckDB. DuckDB is the, the way it’s described if, you know, it’s, it’s sort of like SQLight, but for analytics for OAP, it’s a very, very fast, very small embedded-
Jon Krohn: 01:11:53
OAP?
Josh Wills: 01:11:54
Oh, oh. So on online analytical processing, data warehousing kind of stuff. Like when you’re doing like typically speaking of databases is either row oriented like MySQL Postgres where data sort row by row and it’s very fast to do kind of point lookups of a row or in a calmer fashion, which is the way like Snowflake and Redshift and all this kind of other big query, they all sort data in a calmer fashion where it’s relatively expensive to load it cuz you have to convert all the data to the calmer format, but then it’s very fast to query and do aggregations and stuff like that on that.
01:12:24
So SQLight is an embedded OLTP database. It stores stuff by row DuckDB is an embedded OAP database. It stores things and kind of operates on data in a column or fashion like, like a data frame effectively. Right. So I do a lot of, lot of work around DuckDB mostly in Python. I work on the DBT adapter for DuckDB, which is called DBT DuckDB which is a lot of, a lot of fun and a great source of joy for me. I’m trying like, Jon to build or kind of rebuild or, or reconceptualize, what does a data warehousing stack look like? Like in a, in a way that takes advantage of all of this like sort of fantastic web server technology that we have developed over the last 20 years or so.
01:13:10
Like, like Postgres was kind of born before the web, right? In some ways, like Postgres has its own protocol, it’s not HTTP and stuff like that, right? It has its own off system, it has its own all this kind of other stuff. And I’m very curious about the idea of rebuilding this stuff now out of like these individual components and kind of reassembling them together and sort of seeing what I come out with on the other side. I would like to build a radically cheaper, like radically cheaper, like a dollar a month kind of cheaper like data warehousing sort of stack right now, this is, this is my sort of weird obsession. This is not a good business if you’re in a business where like, you know, the stack costs a dollar, it’s, it’s kind of hard to make a living doing it. But it’s something I’m just deeply, deeply curious about and I’m trying to like figure out, so that’s, that’s kind of broadly what I’m up to these days.
Jon Krohn: 01:14:02
It sounds like it could be disruptive. And then there, there are good business models around that where if you can build, so kind of like Wes McKinney, you know, yes. You can build open source. With Pandas and then more narrow with Apache Arrow. Yeah.
Josh Wills: 01:14:16
Arrow. Yes, exactly. That’s right. Totally.
Jon Krohn: 01:14:18
So you build these open source tools that are highly disruptive, but then you can charge consulting fees to help people implement it and take advantage of that.
Josh Wills: 01:14:28
Totally. Exactly. Exactly. Whether or not that’s how I wanna spend my retirement years or not is kind of a, is kind of a question. But for the time being like, this is the problem, I’m like obsessed with solving, so this is what I’m working on. Yeah.
Jon Krohn: 01:14:40
That sounds awesome. And I am sure-
Josh Wills: 01:14:42
At least maybe like figuring out if it’s possible, I guess is probably, I know I’m gonna solve it, but like, can I, can this work? This is again, that’s kind what I’m curious about.
Jon Krohn: 01:14:50
Yeah. Yeah. And even just solving pieces of it could be game changing.
Josh Wills: 01:14:53
If, if anything, it’s like a very interesting problem and it’s just super fun for me. So that’s always been my time these days. Yeah.
Jon Krohn: 01:14:59
So yeah, I’m, I’m sure our guests just as I am, are confident that whatever happens with this project or whatever happens next in your career is going to continue to be extremely interesting. Just like your path has been so far.
Josh Wills: 01:15:11
Oh, thank you for, man. As I said, I’ve been very, very lucky, extraordinarily privileged to have, yeah, to get the chance to do this kind of stuff. And obviously happy to be at the disposal of, of the next major crisis disaster that needs, that needs some, some data pipeline optimization.
Jon Krohn: 01:15:24
Nice. Let’s cheer for another crisis.
Josh Wills: 01:15:26
Yay.. That’s not…
Jon Krohn: 01:15:29
No, no, no.
Josh Wills: 01:15:32
Yeah.
Jon Krohn: 01:15:32
So near the end of every episode, I ask for a book recommendation and with all of the amazing topics that we had to cover about you, we didn’t even mention that you have a book. You have a book called Advanced Analytics with PySpark. It was published by O’Reilly in 2022. So it’s a relatively recent book and yeah. You wanna tell us a bit about it?
Josh Wills: 01:15:57
Yeah, it’s, so it’s actually, it’s an a, it’s the third edition. It’s an adaptation of a book that I wrote when I was a Cloudera with my Cloudera data science team with Sean Owen and Sandy Ryza and Uri Laserson back in 2014. Believe it or not, it was the first, it was written for like Spark 1.0, which was like maybe even, maybe even prior to PySpark actually existing. It was, it was done in Scala and stuff like that. And so yeah, it was just a sort of collection of some of our own favorite patterns and libraries and stuff like that for doing, doing advanced analytics with Spark. And then did a second edition of it, made some fixes and some changes and some evolutions. And then a wonderful, wonderful young man named Akash Tandon came along and redid the whole thing in Python. So adapted our stuff at Python, modernized it, and then yeah, like cut a new release of it. But it’s still yeah, still one of, one of my absolute favorite things I’ve done. I tried to was, anyway, there was a lot of things in life, Jon, where’s like, the experience of doing them is awful, but you’re glad that you did it. And like writing a book is, is pretty-
Jon Krohn: 01:16:53
Oh no, I’ve, I’ve written a book too.
Josh Wills: 01:16:55
You’ve written a book. It’s like, yeah, you know. Exactly. Yeah, totally.
Jon Krohn: 01:16:57
It was, it, I, up until that point, the worst experience of my, of my life was my PhD dissertation.
Josh Wills: 01:17:04
Yeah. Okay. That sounds much worse.
Jon Krohn: 01:17:06
The book, the book was worse. A book is worse because you know, or you hope that at least someone’s going to read it, which at least with a PhD dissertation, you don’t have to worry about that.
Josh Wills: 01:17:14
That’s right. Exactly. That that’s this is gonna go on a shelf somewhere and like never think about it again. Exactly. Exactly.
Jon Krohn: 01:17:20
So yeah, it was horrible, but absolutely. It’s, it’s one of those things. Do you ever think about things or have you heard of this like framework of type one fun versus type two fun?
Josh Wills: 01:17:31
No, I don’t know this. What is, I don’t know, this framework, it’s [crosstalk 01:17:34] I should covered in my decision stuff, but yeah, go ahead. What is it?
Jon Krohn: 01:17:37
Yeah, it’s, it’s really simple and I don’t know if this ever was like officially coined in a book or something, but it’s something that with friends I’ve been talking about for at least 15 years and I remember the first person who mentioned it to me. So it was, it was while I was doing my PhD that somebody explained this concept to me. So type one fun is what you traditionally think of as fun. So it’s like something that directly stimulates dopamine, serotonin production, like partying, driving a car fast, you know, sex.
Josh Wills: 01:18:12
Doing cocaine. Exactly.
Jon Krohn: 01:18:14
Doing cocaine is probably the most direct, very direct type one fun. You’re like, just put it in my veins.
Josh Wills: 01:18:19
I did neuroscience stuff in undergraduate, by the way. Like that was my, I was, I thought I was gonna be a brain scientist back in the day. So I love all the, yeah. Anyway. That’s great. Yeah,
Jon Krohn: 01:18:27
My, my PhD was neuroscience.
Josh Wills: 01:18:29
That’s awesome. Dude, I, we could have done, we could have done a whole section on neuroscience.
Jon Krohn: 01:18:31
We’re gonna have to, [crosstalk 01:18:32]
Josh Wills: 01:18:32
A shame you could have like caught me up on like, what’s like, what’s going on with the basal ganglia, like are we still…? Some other time anyway.
Jon Krohn: 01:18:39
Yeah. I mean, I finished my PhD a while ago, 10 years ago, 11 years ago. So, yeah, I don’t know. I don’t know.
Josh Wills: 01:18:44
I stop doing the research like 20 years ago. So you’re still like, you’re still up on me anyway.
Jon Krohn: 01:18:48
Yeah, yeah. It was my it was my slippery slope into data and machine learning. Working with big data sets like genetics data, genomics data, really. And brain imaging data. And so there was an abundance of people in our university in Oxford University creating large amounts of data. And there weren’t enough people around to do something to find patterns in the data. And so I was able to get a bunch of papers and, and actually frankly, I don’t know if this is still like a hack that you can do today if you’re doing a PhD in like in, in a lot of scientific fields, but specializing in data science is like a hack because people will spend years literally creating a data set.
Josh Wills: 01:19:38
Yes, yes.
Jon Krohn: 01:19:39
And then they’re like, you know, I don’t really know how to analyze this. They’re like, you know, I think I see this pattern here. Like if I graph it, I can see that like there’s this and that. Or I know from my many years of working with ferrets, with the like electrode implanted in the brain that there’s this particular effect. They’re like-
Josh Wills: 01:19:56
Exactly, exactly, yeah.
Jon Krohn: 01:19:56
Do I like prove this statistically? And I’m like, well so like there are instances from my PhD where I did less than a day of work, but was one of the first authors on a paper because-
Josh Wills: 01:20:09
That’s awesome.
Jon Krohn: 01:20:09
I was able to just be like, well you need to just do this test do this thing.
Josh Wills: 01:20:11
Exactly. It’s like not, it’s not even that hard. Right, exactly. That’s fantastic. That’s, that’s some leverage right there, man. That’s good stuff.
Jon Krohn: 01:20:19
Anyway, how did I get down to this path?
Josh Wills: 01:20:25
The type one, type two. So that was type one [crosstalk], type one type is dopamine and then type two.
Jon Krohn: 01:20:29
Type, type one is direct stimulation, type two fun is this kind of thing like writing a book or exercising where while you’re doing it you’re like, this is-
Josh Wills: 01:20:41
This, this, this sucks, this is horrible. Yes. Precisely. Yeah.
Jon Krohn: 01:20:44
But then when it’s over and you look back on it, you do get that rush of dopamine and you’re like, wow, that was really great.
Josh Wills: 01:20:51
Exactly. That’s right. You feel fantastic about it. Absolutely.
Jon Krohn: 01:20:54
Type one, type two fun. Anyway, so Analytics with PySpark solid type two fun experience. I guess in that case specifically that edition for Akash Tandon, it was probably-
Josh Wills: 01:21:07
Mostly type type two fun. Exactly. For, for me, for me it was relatively light. It’s editing and stuff. Yeah. It wasn’t so bad.
Jon Krohn: 01:21:12
Yeah, exactly. Early, early like 2012, 2013, you were going through it, getting the first edition out. So beyond your book, do you have recommendations of other books for our audience?
Josh Wills: 01:21:26
Oh Chip, Chip Huyen’s book Designing Machine learning Systems. I literally have it like right next to me, actually pretty clear. I’m actually, see this is my, this is my autograph copy or my other copy. I’m not sure. I have a couple. I saw Chip at a conference recently and I had her sign a copy for me. So yeah. And then another one I just love cuz I’m actually like, you know, a history nerd as much as anything like I love the Robert Caro books, like all of them from the Power Broker, like the Lyndon Johnson series. Rise and Fall of American Growth was absolutely wonderful. Like highly recommended if you, there’s also a, there’s Slouching towards Utopia, but like the kind of story of like economic growth and the change of an American society from like 1870 to the present is just like absolutely like mind-blowingly fascinating. And I just like thoroughly love those books. There’s another, it is a good it’s an older book. It’s From Dawn To Decadence by Jacques Barzun. It’s a historian who also, I just like, just books where like, just the density of information and discovery and joy I find on each page is just like off the charts high and like Rise and Fall of American Growth is the most recent one I read like that. So yeah.
Jon Krohn: 01:22:31
Nice. You actually answered that question similarly to Chip, where-
Josh Wills: 01:22:35
Oh really?
Jon Krohn: 01:22:35
She reeled off. I mean, all of the books are different.
Josh Wills: 01:22:39
Oh yeah, of course. Sure.
Jon Krohn: 01:22:40
But she just had so many books that she wanted to talk about which was really interesting in filming her episode because she was already over time like we had, you know, she’s the co-founder of a fast-growing tech startup.
Josh Wills: 01:22:52
Yeah, of course. Exactly.
Jon Krohn: 01:22:52
Her time is extremely in demand.
Josh Wills: 01:22:54
Yeah. Yeah.
Jon Krohn: 01:22:55
And so we were over and like I told her before I asked the question, I was like, look, we’ve gone over time so I know we need to wrap this quickly, but I asked this question to most guests, so you know, just gimme a quick answer. And then she spends several minutes going through all these books. She looks back over at her bookshelf, which is over her shoulder and is like, that’s that one. There’s that one and there’s that one.
Josh Wills: 01:23:14
So that, that sounds exactly like Chip and that’s, that’s exactly exactly why I love her. That’s fantastic. Look forward to that episode.
Jon Krohn: 01:23:22
Yeah. At, at the time of recording it isn’t out yet, but yeah, it will have been out by the time people are listening to your episode. So that was 661, 2 weeks before your episode airs 665. So yeah, very last question.
Josh Wills: 01:23:41
Yeah.
Jon Krohn: 01:23:42
How do people follow you after the show? Obviously you are a wealth of interesting information. We’ve talked about your Twitter channel.
Josh Wills: 01:23:52
Yeah. Twitter, Twitter’s Twitter still exists, you know, as long as I kinda got into at a fun Twitter exchange today, which is like someone did like the Good Will Hunting, like the Ben Affleck scene at the end analogy with me, which is like, that’s my relationship with Twitter. Like eventually Twitter will die and I will be free, but for the time being Twitter still exists and, and therefore, therefore so do I. So I, I’m, I’m basically like imprisoned there with the rest of the, the crazy people. I think I, my, my request Jon would be like for folks to follow me on GitHub like follow, follow what I’m doing on GitHub, follow me there. So is to motivate me to do the actual work I should be doing and not like shit posting and like, you know, like reply guying on Twitter, which is awfully, awfully fun, but it’s probably not the best use of my time.. So yeah, on GitHub, I’m, I’m Jay Wills and that’s where I’m also a gainfully unemployed data person and like I said, hopefully where I’m working on stuff that will maybe someday hopefully fingers crossed, be useful to somebody. So yeah.
Jon Krohn: 01:24:51
Yeah. Nice. I have little doubt. Josh, this has been the fascinating episode that I knew it would be. Thank you so much for coming on the show and we’ll have to check in with you in a couple of years and see how the dollar data warehouse is coming along.
Josh Wills: 01:25:07
That’s a good, I’m gonna steal that, Jon, that’s fantastic. Thank you for that. I appreciate it. The dollar data warehousing menu. Anyway, thank you so much for having me, man. This was tremendous fun. I really appreciate it.
Jon Krohn: 01:25:22
What an honor to be able to learn from a legend like Josh Wills. In today’s episode, Josh fill us in on how machine learning engineering is software engineering on hard mode because scaling testing and ensuring reliability are all harder. How contextual bandits randomly insert some results into user’s feeds, allowing optimization of ML algorithms to occur an order of magnitude faster than with AB testing. How simple analysis on a large amount of data is typically superior to sophisticated analyses on small amounts of data. How the infinite loop of sadness may be avoided through having data centralized under a single trusted leader. How electrical grids can be made smart with data and machine learning to help prevent overload as grids adapt to the strain of electric vehicles charging at home. And now considering or combining concepts from a lattice of mental models can result in better decision making outcomes.
01:26:10
As always, you can get all the show notes including the transcript for this episode, the video recording, any materials mentioned on the show, the URLs for Josh’s GitHub, and Twitter, as well as my own social media profiles at www.superdatascience.com/665. That’s www.superdatascience.com/665. If you enjoyed this episode, I’d greatly appreciate it if you left a review on your favorite podcasting app or on the SuperDataScience YouTube channel and of course subscribe if you haven’t already. I also encourage you to let me know your thoughts on this episode directly by following me on LinkedIn or Twitter and then tagging me in a post about it. Your feedback is invaluable for helping us shape future episodes of the show. Thanks to my colleagues at Nebula for supporting me while I create content like this SuperDataScience episode for you. And thanks of course to Ivana, Mario, Natalie, Serg, Sylvia, Zara and Kirill on the SuperDataScience team for producing another scrumptious episode for us today.
01:27:00
For enabling that super team to create this free podcast for you, we are deeply grateful to our sponsors whom I’ve hand selected as partners because I expect their products to be genuinely of interest to you. Please consider supporting this free show by checking out our sponsors links, which you can find in the show notes. And if you yourself are interested in sponsoring an episode, you can get the details on how by making your way to jonkrohn.com/podcast. And thanks of course to you for listening. It’s because you listen that I’m here. Until next time my friend, keep on rocking it out there and I’m looking forward to enjoying another round of the SuperDataScience podcast with you very soon.