86 minutes
SDS 585: PyMC for Bayesian Statistics in Python
Subscribe on Website, Apple Podcasts, Spotify, Stitcher Radio or TuneIn
Tune in to this week's episode to hear Dr. Thomas Wiecki, Core Developer of the PyMC Library and CEO of PyMC Labs deliver a masterclass in Bayesian statistics. Discover why Bayesian statistics can be more powerful and interpretable than any other data modeling approach, and learn more about PyMC and how to build a successful company culture.
Thanks to our Sponsors
About Thomas Wiecki
Dr. Thomas Wiecki is an author of PyMC, the leading platform for statistical data science. To help businesses solve some of their trickiest data science problems, he assembled some of the best Bayesian modelers out there and founded PyMC Labs -- the Bayesian consultancy. He did his PhD at Brown University.
Overview
Calling in from Berlin, Germany, Dr. Thomas Wiecki joins Jon Krohn for a conversation that dives deep into the world of Bayesian statistics.
For those new to statistics, Thomas kicks off the conversation with an explanation of Bayesian statistics, followed by a look at PyMC. Fundamentally, he says that it is the modular, LEGO approach to building data models. "Bayesian statistics for me is a different approach to doing data science," he says. "You build a particular model for a particular purpose." One key benefit Thomas highlights is that Bayesian statistics is very principled about how uncertainty is quantified.
Next, Thomas discusses PyMC, the leading Bayesian modeling library for Python. As one of its core developers, he shares how it evolved, how it trains models so efficiently, and how it automates the sampling of probability distributions and estimates Bayesian models.
Jon and Thomas then discuss case studies of Bayesian stats applied effectively in commercial applications. Thomas also addresses the extra flexibility and modeling advantages of hierarchical Bayesian models and shares his top resources for learning Bayesian stats. And if that wasn't enough, Jon and Thomas also managed to include a few references to the Beatles and Eric Clapton!
Two years ago, Thomas co-founded PyMC Labs to solve commercial problems with advanced large-scale Bayesian data models and has served as the consultancy's CEO since. When creating PyMC Labs, Thomas reveals how he built a great company culture. While he took notes from his time as VP of Data Science at Quantopian, he says that the book Reinventing Organisations was beneficial for crafting a fluid company that relies on self-organizing principles.
Tune in to this episode to learn about Thomas's work, PyMC Labs, and the tools and approaches that excite him these days.
In this episode you will learn:
- What Bayesian statistics is [7:30]
- Why Bayesian statistics can be more powerful and interpretable than any other data modeling approach [17:20]
- How PyMC was developed [20:41]
- Commercial applications of Bayesian stats [43:07]
- How to build a successful company culture [1:03:14]
- What Thomas looks for when hiring [1:11:13]
- Thomas’s top resources for learning Bayesian stats yourself [1:13:57]
Items mentioned in this podcast:
- Z by HP
- PyMC
- PyMC Labs
- PyScript
- Intuitive Bayes Online Course
- Introducing PyMC Labs: Saving the World with Bayesian Modeling
- MCMC sampling for dummies
- Machine Learning and Statistics: Don't Mind the Gap. By Thomas Wiecki at ODSC Europe 2018
- SDS 507: Bayesian Statistics
- Bayesian Modeling and Computation in Python by Osvaldo A. Martin, Ravin Kumar, Junpeng Lao
- George Harrison: Living in the Material World
- While My Guitar Gently Weeps
- Wonderful Tonight: Pattie Boyd’s love triangle with George Harrison and Eric Clapton
Podcast Transcript
Jon Krohn: 00:00:00
This is episode number 585 with Dr. Thomas Wiecki, core developer of the PyMC Library and CEO of PyMC Labs. Today's episode is brought to you by Z by HP, the workstations for data science.
Jon Krohn: 00:00:18
Welcome to the SuperDataScience podcast, the most listened to podcast in the data science industry. Each week, we bring you inspiring people and ideas to help you build a successful career in data science. I'm your host, Jon Krohn. Thanks for joining me today. Now, let's make the complex, simple.
Jon Krohn: 00:00:49
Welcome back to the SuperDataScience podcast. Today, we've got the extraordinary data scientist, software developer, and entrepreneur, Thomas Wiecki. Thomas has been a core developer of PyMC, the leading Python library for Bayesian statistics for over eight years. Two years ago, he co-founded PyMC Labs to solve commercial problems with advanced large-scale Bayesian data models, and he has served as the consultancies' CEO since. Previously, he worked as the VP of Data Science and Head of Research at Quantopian, a fund with an algorithm crowdsourcing model that raised $50 million in venture capital. He holds a PhD in Computational Cognitive Neuroscience from Brown University.
Jon Krohn: 00:01:29
Today's episode is more on the technical side, so we'll appeal primarily to practicing data scientists. In today's episode, Thomas details what Bayesian statistics is, why Bayesian statistics can be more powerful and interpretable than any other data modeling approach. How PyMC, the leading Bayesian modeling library for Python was developed, and how it trains models so efficiently. He provides case studies of Bayesian stats applied effectively in commercial applications. He talks about the extra flexibility and modeling advantages of hierarchical Bayesian models. He provides his top resources for learning Bayesian stats yourself, and he talks about how to build a successful company culture. All right. You ready for this deep, yet surprisingly fun episode? Let's go.
Jon Krohn: 00:02:16
Thomas, welcome to the SuperDataScience podcast. I'm excited to have you here. Where in the world are you calling in from?
Thomas Wiecki: 00:02:24
Thanks Jon. I'm very happy to be here. I'm calling in from Berlin, Germany.
Jon Krohn: 00:02:29
Yes. What's it like, as we can start to get from spring into summer, in Berlin? I bet there's a lot more outdoor activities, lots of parties.
Thomas Wiecki: 00:02:40
For sure, yeah.
Jon Krohn: 00:02:41
Is it like outdoor drinking culture, like beer gardens and that kind of thing?
Thomas Wiecki: 00:02:47
Yes. A lot of that. It feels like everyone just thrusts themselves in during the winter months, which are just like gray and rainy and typical... What you would think of, of stormy weather. Yeah. Then, when spring comes around, everyone gets excited and goes outside. Yeah. There's a lot of daylight hours, so there's sun and light out deep into the evening. That makes for a very good beer garden sessions.
Jon Krohn: 00:03:18
Nice. Yeah. I have actually never been to Berlin, but it seems like a wonderful place to live. I've had a lot of friends go out there and they say that it's just fabulous. You haven't always been in New York. You did a PhD at Brown University, which is in Rhode Island in the Northeast US. Then, you were at Quantopian after that. Is Quantopian where you're based in New York? Are they in New York?
Thomas Wiecki: 00:03:45
They're based in Boston.
Jon Krohn: 00:03:46
In Boston.
Thomas Wiecki: 00:03:47
But yeah, that's right. I did my undergrad in Tubingen, Germany, which is home to the Max Planck Institute there, where a lot of really cool, deep learning research is being done. That really motivated me to explore more science, and that brought me to Brown providence. Yeah. It's a really cool city. I'm pausing. I mean, the city is really cool because Brown is amazing, and it's cool also because it's close to New York and Boston, which I really enjoyed. Yeah. Once a week, I would go up to Boston to work at the Quantopian offices and have a good time there.
Jon Krohn: 00:04:27
Yeah. You spent a long time there and we'll get into that a bit later, but I just wanted to give a sense of geographies there. I guess Germany was calling you back and Berlin seems like a great place to be in the world for tech in general, anyway, so cool choice.
Thomas Wiecki: 00:04:43
Yeah.
Jon Krohn: 00:04:43
We know each other...
Thomas Wiecki: 00:04:44
[inaudible 00:04:44].
Jon Krohn: 00:04:44
Oh yeah, I bet. We know each other through Reshama Shaikh, who I've known in New York for nine years. Reshama was instrumental in helping me kick off my public data science speaking career. On my website, if you go to johkrohn.com/talks, and you scroll back to 2013 when I met Reshama, I was doing... Well, at one point, I was doing zero talks like 2011, 2012 is nothing. Then 2013, I met Reshama when I was doing one of my first talks, public talks on data science ever. Then, she started getting me involved with more data science meetup networks. You'll see, they start to increase exponentially from that year. Now, it's well over 100 talks a year for the last few years. Reshama was key to that initial explosion of opportunities. I'm super grateful to her. She is a invaluable hub in the New York data science community. She is involved in so many different data science and machine learning communities. It's super cool, and it sounds like she now is doing some work for you.
Thomas Wiecki: 00:06:04
That's right. Yeah. As part of the PyMC project, which I'm sure we'll also talk about, which is open source. We did with Data Umbrella, her organization, she had event where we had talks and hackathons just to get the community together. That's how I met her. I was just really blown away by her capacity to community build and do social media outreach and all of that. I was like, "Well, that is really amazing." The skillset, that is very rare to find. Right? Someone who really knows open source, knows statistics. Right? She's like a properly trained statistician.
Jon Krohn: 00:06:46
Yeah.
Thomas Wiecki: 00:06:47
But then, also, knows about social media and marketing. I then tried to recruiting her for PyMC Labs, the company that I run that is based on the software. I asked and she said, "No, I'm not interested." But then over time, I sort of wore her down to now, work with us and do really cool stuff with PyMC and PyMC Labs. Yeah. She's been invaluable to the community and also to the project and the company.
Jon Krohn: 00:07:16
Awesome. Let's talk about both of those things now. Let's talk about PyMC first. PyMC is an open source library for doing Bayesian statistics. I have used it in the past, but when I used it, it was called PyMC3. Maybe you can give us a quick introduction to what Bayesian statistics is. Then, also tell us about this library, formally known as PyMC3, now known as PyMC.
Thomas Wiecki: 00:07:42
Sure. Yeah. Bayesian statistics or Bayesian modeling for me is a fundamentally different approach to doing data science specifically when you compare to things like machine learning. Machine learning for me is very much... Well, you have one tool and that you just pump in data and outcomes predictions. Often, that's why it's called black box because the thing that gets trained, you don't really know what's going on underneath. Right? The predictions come out and oftentimes, they're quite good if you did everything correctly, but that if you can't really inquire the model and understand why it came up with this, it's very hard to really gain insight about your data and learn yourself about the type of things that are going on. That, for me, just disqualifies it for a lot of applied business problems, which is what I care about because there, when you tell your manager like, "Oh, we should be increasing our media spend by 100%. I can't tell you why, but the black box said we should." That often doesn't fly as well.
Thomas Wiecki: 00:08:57
Bayesian modeling instead is like what I like to call the Lego approach, where you build a particular model for a particular purpose. You customize it for this thing. It's then hand tailored for this data science problem that you're working with. Just like with Lego, right? You buy the spaceship thing and you can build the spaceship with it, but you can also build a robot or whatever it is that you really want. The building blocks that you're using to build this up are probability distributions. Fundamentally, that is what we're doing and we're plugging things into each other and basically mapping the business problem or the research problem. The process of how the data was generated, we map that into a model and then this model essentially is capable of generating new data that we might observe. Once you have that or I'd say over the forward model, you can then say, "Okay. Well, now that I've modeled the data generating process, how can I invert that to say, 'Okay. Given the data I actually have observed, how can I reason my way backwards now to the hidden causes that I actually care about?'" Like for example, how effective was that media channel in driving sales, which is a thing that I can't observe. Right? I can only observe the downstream effects.
Thomas Wiecki: 00:10:24
Bayesian modeling is just really, really good at that. One thing that is often also mentioned as a key benefit is that it is very principled about how uncertainty is quantified. Again, with machine learning, usually you get the best fitting solution and the most likely answer to your solution. That's cool in some cases, but often let's say you're in a medical situation, probably you don't just want to say like, "Yes, that is cancer or not cancer." You want to have some probability. You want to have some uncertainty, right? If it's based on very little data, probably you're just going to say, "Well, I can't say. I don't have an answer for this," which is a perfectly reasonable thing to do in the absence of data. Bayesian modeling allows you to express these things, because not only are the models built out of these probability distributions, but the outputs of it, the parameter estimates are also probable distributions.
Jon Krohn: 00:11:29
Right. Right.
Thomas Wiecki: 00:11:31
That is like the core principle is we are quantifying uncertainty with probabilities and that's, for me, the most intuitive way of going about doing that. But really, the other thing that I think for me makes it so powerful is when you actually go about doing that. Right? So far, what I describe is just the theory in that is like hundreds of years old, but it never has been really possible to use this beautiful theory in practice, just because the math is just really, really gnarly when you go about doing it and building your own model. Right? Now, with estimation algorithms, we can solve it analytically, but we can estimate the solution. We can estimate these probable distributions that we're interested in that are our answers using a very general class of algorithms called Markov chain Monte Carlo.
Thomas Wiecki: 00:12:31
Those then allowed you to automate the fitting process, the estimation process, which is one piece of the puzzle, but the other piece of the puzzle is, "Well, how do I even build the models??" Right? I just said that the core strength of this is that we can build customized models for particular data science problem. That's where the power of probabilistic programming comes in, which is if I can select PyMC or Stan or on Pyro, and they allow you to write in computer code, in the case of PyMC, it's Python code, a statistical model, where you say, "Well, I have a variable, for example, that tracks the effectiveness of that media channel. I have another one that tracks how that media channel interacts with this other one."
Thomas Wiecki: 00:13:17
You just have these parameters specified how they interact with each other, and then how they relate to the outputs. Right? That's how the generative process works from unobservable causes down to data. You write that out like you would a Python program that generates data, right? You just have individual nodes in that graph that interact and generate the data. Once you have coded that, then you just hit that, what I like to call the inference button, which runs that estimation algorithm, no matter what model you threw at it, and then you get the answers. That, for me, is really the super power of Bayesian modeling these days, is that we have really powerful tools that automate this whole workflow and make it better and iterative so that we can go in, build a very simple model, see where it's lacking, improve it, just like writing program, right? You start simple and you improve it. Now, we can have that very powerful workflow just applied to statistical modeling.
Jon Krohn: 00:14:20
This episode of SuperDataScience is brought to you by Z by HP. Get rapid results from your most demanding data sets, train data models, and create data visualizations with Z data science machines, which come in both laptop and desktop workstation options. The data science stack manager on these Z by HP machines provides convenient access to popular tools and updates them automatically, so this helps you customize your environment easily on either Windows or Ubuntu. Find out more at hp.com/datascience. That's hp.com/datascience. All right. Now, back to our show.
Jon Krohn: 00:14:59
Nice. Yeah. We have this field of research, Bayesian statistics is theoretical concepts that are, as you say, hundreds of years old, but it isn't until the advent of computing in the last few decades that we can make really good use of this theory efficiently because it's computationally intensive to be doing this kind of probability distribution sampling that you're describing. As we start to add more nodes into our computational graph, it becomes way, way, way more complicated until we had compute in the last few decades, and now as compute gets cheaper and cheaper and cheaper exponentially, just like it's opened the door for machine learning, it has also opened this very old mathematical approach. It's opened the door for it to become widely useful in a broad range of applications.
Jon Krohn: 00:15:56
I love the way that you describe it as the Lego approach to building a model where you understand what each of these nodes are. You can be quite prescriptive about how information flows between the nodes, as well as the characteristics of each of the components in the model, so you can be relatively prescriptive or unprescriptive about the shape, the center of a distribution, the variance of a distribution. Yeah. You can be relatively agnostic about some aspects of it, or you can be very specific. You could say, "Okay. Based on this previous information that I have that I know from this research paper or from my experience, I have a pretty good idea of how this particular part of the model should behave, so I'm going to seed it with this prior distribution, and then you can use data to make adjustments to that distribution and other distributions in your model." It is really cool, that Lego building block idea of saying like, "Okay. I've got a long yellow block here and I've got like a small red block over here, and these are the pieces that are connected." That's a really cool visual of how we can link together all these different probability distributions.
Jon Krohn: 00:17:19
Then, yeah. Something that makes Bayesian statistics... You already alluded to this, but I want to say it again to highlight it for listeners that aren't already aware of Bayesian statistics, the thing that separates it from any other approach that I'm aware of is that the outputs are probability distributions. You don't get a scaler value as an output. You have probability distributions, which means that you can interpret those in a much more nuanced way than just some scaler value, right?
Thomas Wiecki: 00:17:48
Yeah, exactly. That is often where people really start to get it once you're like, "Well, yeah. Why should we just have to make a single prediction when we can make a whole distribution of predictions, according to how plausible each of those outcomes are."
Jon Krohn: 00:18:04
Yeah.
Thomas Wiecki: 00:18:04
Right? That is often a key turning point. I really like what you said about the computational revolution that was underlying this and that continues to be a major driver of these tools. One in particular, what PyMC has been really big on is using new computational backend. Before you mention that, there's like PyMC3, and now we ran to PyMC. We did that because there's a new version coming out called PyMC 4.0, so we didn't want to have PyMC3 4.0, which sounds [inaudible 00:18:44]. PyMC 4.0 will be the successor, and we've been working on it for like a really long time. One of the coolest new features, there are many, but one of the coolest ones is that we now can have the model that you build and then run it on different computational backend. Before the computational backend, we would take the model, compile it to C, compile it to machine code to make it fast. But now, we can, for example, also compile to JAX, which is a really awesome new compute library from Google.
Thomas Wiecki: 00:19:20
That leads to incredible speed ups. On the CPU, we're seeing speed ups like three to five times, but you can also now take that model and run on the GPU. In there, we see speed ups of like 20x and higher without having to do anything. I think yeah, there's like, well, at least three stages. There's the first one where it was just like completely intractable. Then, other tools came around like WinBUGS and JAX that made it work for small models. But now, the type of models that we are building for clients at PyMC Labs are massive in scale. They have like hundreds of thousands of parameters and work on millions of data points. Really, really complicated real world models. Nonetheless, they estimated under one hour. That is really... I think something that not everyone has realized yet is that this is not just useful toy for small data problems with very few parameters. But actually, no, you can build insanely complex models on fairly large data sets.
Jon Krohn: 00:20:32
Nice. We will get to that. We'll talk about PyMC Labs and the consulting that you're doing with these big models, get into a few use cases. Before we get there, let's dig a bit more into the PyMC library. When people said about doing Bayesian statistics, the PyMC library is one of the clear, most popular choices for getting started and doing it in Python. There are alternatives out there. Maybe it would be helpful to explain what the difference is between PyMC and some of the other libraries. In episode number 507, we had Rob Trangucci on the podcast talking about Bayesian statistics, and he's a core developer on the Stan team. We talked a lot about Stan and the approach to that library in that episode. Maybe you can compare and contrast what makes PyMC different from Stan or other ways of training a Bayesian model?
Thomas Wiecki: 00:21:41
Sure. Yeah. Stan is an amazing package and they started around the same time developing it than we did with PyMC3. Yeah. A lot of their functionality and tools and inference articles, we just copied from them. You find a lot of huge influence from Stan and PyMC3, and they were helping us early on and continue to do so. Yeah. They're awesome, and they are the ones who are pushing this field forward a lot. Yeah. I really like the library. There are definitely a couple of very core differences in terms of technical choices.
Jon Krohn: 00:22:23
Philosophical differences.
Thomas Wiecki: 00:22:24
I think the main one... Yeah. If you want to call it that. One of them is that they initially, I think really targeted R, which is...
Jon Krohn: 00:22:35
Correct.
Thomas Wiecki: 00:22:36
... where a lot of the statisticians still operate in, especially the academic ones, and they also followed this, I guess inherited approach from other systems like WinBUGS, I mentioned before, JAX where you have a specific language that is custom to the probabilistic programming system that you're working with. It's like its own language that you're writing these models in, and that has its benefits in terms of accessibility, but it has its downside in terms of this just being, for example, by now, there are things like PyStan and CmdStan that allow you to also run models in Python, that's fine. But the model code itself would be just a string, right? That you then pass to this system that runs on C somewhere else, not in Python. Obviously, it's cool that you can run it from all these different packages from all these different languages. Yeah. I think for me, I'm not a statistician.
Thomas Wiecki: 00:23:42
I have a coding background, and I always have been coding in Python. The thing that I love about Python is that it's the glue, right? It sticks to everything. Everything, I can do from within Python and in Python. For me, not being able to write my model in Python is a downside because I really want to be able to use the same syntax I use for everything else, for plotting, for data input, outputting, and then for writing a model. That then, of course, translates to other things like just interoperability with the rest of the Python ecosystem where it's really just a library and not a framework. In terms of deployability and those types of things, it just hooks nicer in there. I guess related to that is also that we have built it on top of the Theano library, which was like TensorFlow before it was cool. Well, I guess TensorFlow isn't cool anymore either, but now that the JAX...
Jon Krohn: 00:24:50
The statement is still true.
Thomas Wiecki: 00:24:53
Yeah, exactly. Theano was really so far ahead of its time and it's that first library where you can write these... Well, focus on deep learning models and you build up computational graph. It was John Salvatier who was the actual originator of PyMC3 who thought like, "Oh, wait a second. We have this graph computation library that people have written for deep learning. Couldn't we use that for policy programming?" That was, I think the absolutely central idea, right? By now, there are other packages like Pyro or JAX that build on these. It's very standard by now, but back then, it was revolutionary. That approach really allowed a lot of other technical innovations, I think, like I just mentioned with PyMT4. Right?
Thomas Wiecki: 00:25:50
If we have PyMC, which is really just 100% pure Python code, but it's not slow because it's building on top of Theano, which is that library that is building up that computational graph. Then, once you have that computational graph representation, which is the mathematical terms just off your model evaluation function, you can do all kinds of cool shit with that. You can simplify that and do mathematical simplification. If there's a log of an X of X, well, you just turn that into an X or a lot of other cool things. You do these optimizations on the compute graph, and then you take the compute graph and you compile it to C or JAX, which is the new thing. The other thing I want to mention there is that I talked about Theano, which was discontinued, which was actually quite a bummer for us.
Jon Krohn: 00:26:44
Yeah. With the 1.0 release was also the final release many years ago.
Thomas Wiecki: 00:26:50
Yeah, exactly. They're like, "We're done with this. It's as good as it's going to get."
Jon Krohn: 00:26:54
Yeah. My understanding of that is that it isn't so much to do with the library being unsuccessful, but it's that so many of the people... Theano was largely a project out of the University of Montreal and all the people that were working on it got hired by Google. Those people were then at Google working on TensorFlow, and so it didn't make sense to be continuing to develop Theano when now, all these people are working on TensorFlow.
Thomas Wiecki: 00:27:20
Yeah. That's exactly what happened. For us, we were like, "Okay. Well, maybe let's explore TensorFlow," but we did. It turns out it wasn't really a good fit for, I think privacy program in general because they... Just like PyTorch also followed this dynamic graph approach, which by now, these libraries do.
Jon Krohn: 00:27:42
Right.
Thomas Wiecki: 00:27:42
That turned out to be real pain actually. Theano was different because they didn't allow for having this dynamic graph that it's just like, "You run the program and it's creating it on the fly, but rather you specified once ahead of time and then you have it." Then, you can do all the stuff that I just mentioned with the simplification of the compilation. That actually is a really cool feature we found. That has caused us to say like, "Actually, Theano is such an amazing system." Yes. It's had a lot of technical debt accumulate over the years. But Brent Willard, who is afraid of nothing was like, "Okay. Well, I can handle this." He took that library and just completely revamped it, rewrote it throughout a lot of stuff that had accrued and added cool new function. Cool new functionality like the JAX backend or a Numba backend, so made it like a really modern, powerful library that focuses on these type of things. Now, it's called Aesara, which is in Greek mythology, the daughter of Theano.
Jon Krohn: 00:28:53
I didn't know that. That's so cool. Asara, A-S-A-R-A?
Thomas Wiecki: 00:28:58
A-E-S-A-R-A.
Jon Krohn: 00:29:01
Cool. Very cool. Just really quickly while we're...
Thomas Wiecki: 00:29:04
Actually 4.0.
Jon Krohn: 00:29:05
Just really quickly while we're doing spellings of things, a number of times you've mentioned Numba, which is N-U-M-B-A.
Thomas Wiecki: 00:29:11
Yeah.
Jon Krohn: 00:29:13
Then, you've also mentioned JAX. I know that there was a Bayesian library, computational library, J-A-G-S, but you're also talking about JAX, right?
Thomas Wiecki: 00:29:27
JAX.
Jon Krohn: 00:29:27
J-A-X.
Thomas Wiecki: 00:29:28
Correct.
Jon Krohn: 00:29:29
This whole time that that you've been talking, you've only been talking about the latter, right. JAX?
Thomas Wiecki: 00:29:34
That is correct.
Jon Krohn: 00:29:35
Okay. Okay. Okay.
Thomas Wiecki: 00:29:36
When I mentioned the historical context of like the old languages, that's why I mentioned WinBUGS and JAX, but every time since then, it's JAX, the Google computational... Like the new TensorFlow, basically.
Jon Krohn: 00:29:53
Cool. Awesome. All right. We got all that. All right. Then, you were about to talk about PyMC 4.0., and I interrupted you to make sure that I was on track with all the acronyms.
Thomas Wiecki: 00:30:04
All right. Exactly. Yeah. Then, PyMC4, by the time this recording will be released, I... Okay. Well, I will just go in the end and say, it will be released by then. Now, I have to follow through. We actually have hackathon schedules, just get it out because it's been too long. It will be out. That is then now based on Aesara. It has all the cool functionally that is now in there specifically, like I mentioned, the Jax and the GPU to support, but then it has all kinds of other cool new features as well that we're really excited about. Can't wait to get that out and really continue to push what's possible in probabilistic programming.
Jon Krohn: 00:30:53
Super cool. That was a great orientation to what PyMC is, vis a vis, its history and relationship to previous computational libraries, coming computational libraries, the state of the art in Bayesian statistics and graph computation in general, which is cool. Wild Bayesian statistics and machine learning are different kinds of approaches for working with and modeling data. It's cool to think that so much of the underlying infrastructure can actually be common between the two in terms of solving a computational graph.
Thomas Wiecki: 00:31:31
Yeah. Right. Exactly. We're really standing on the shoulders of giants now because these innovations are so powerful and now through this more flexible middle layer of Aesara, we can just like directly make use of that. Yeah. That's awesome.
Jon Krohn: 00:31:50
Super, super cool. All right. We now have some sense of what Bayesian statistics is. We know about the PyMC library for training our own Bayesian statistical models and being able to draw conclusions from data with those models. At the height of the pandemic in 2020, you were working at Quantopian, which we mentioned earlier, which is a quant finance company. You left them to create your own consultancy. We've mentioned the name of that consultancy actually earlier in this episode, already PyMC Labs. There's this clear connection between the PyMC library and the PyMC Labs that you're now CEO of. Why was it the right time for you to assemble what you called in a blog post your Bayesian Avengers into this commercial entity to start solving advanced analytical problems with Bayesian statistics?
Thomas Wiecki: 00:32:57
Yeah. I love the introduction you gave. It happened really... Well, two things came together. One was inside of the PyMC development team, this library, we've been programming on it for a long time, and we have managed to track a lot of really amazing program as statisticians data scientists. The barrier of entry in terms of contributing to that library is fairly high, so that attract the right kind of people who are up for a challenge. We just all really loved working together, and we had these in-person meetings and just start becoming friends really who loved working together on testing and modeling. For many years prior to that, we all had this dream. We're like, "Oh well, wouldn't it be awesome to A, work together, but B, also to use these tools to solve applied business problems?"
Thomas Wiecki: 00:33:58
That was really an open question at that time, because... Well, is there even a demand for that? Everyone was just doing machine learning and deep learning. Did people even understand and want that? We didn't know, but after I had left Quantopian, I just started getting involved interest in terms of companies contacting me and being like, "Oh, yeah. I heard you are just not doing that anymore. We are using PyMC and we could use some help or we heard about Bayesian modeling, heard it's really cool. We think it could be a good fit for a problem. Can you help us with that?" I was like, "Well, sure. Let me first, however, assemble a team, the Bayesian Avengers." One another team likes to call them the Bayvengers. Yeah. Basically, it was pretty easy from that point on. It was basically like in the Blues Brothers movie where I just went to everyone and was like, "We're getting the band back together. Let's do this."
Thomas Wiecki: 00:35:07
Yeah. Everyone was just on board immediately. It's really just a couple of the PyMC core developers that I've been working with for a long time. We all just were really excited about doing that for a long time. Then, we now had actually people requesting those services. We're just like, "Well, let's try it out." It turned out for many people in the PyMC core team also to be an opportune time because a lot of them were frustrated with big corporate jobs, so they were more than happy to join this garage band type of company that we were building and just work on really amazing, challenging Bayesian modeling problems together.
Jon Krohn: 00:35:55
Super cool. I love that story. How's it been going? It sounds like there's been a lot of traction, a lot of inbound interest. People who are using the PyMC library saw value in working with people that could be the best and maybe solve their problems better than they could. It reminds me a little bit, on the show last year, we had Wes McKinney in episode 523. This reminds me of how... Wes McKinney created the Pandas library, which is ubiquitous, but more recently, he's been developing the Apache Arrow library and he created Voltron Data as a company that can then help people use Apache Arrow most effectively. This reminds me of that situation where you have this amazing open source library, PyMC. You've got the best developers working on it already, and now you can take advantage of these... Not take advantage. You can use. You can take advantage of their amazing skillsets, not take advantage of these people. Take advantage of their amazing skillsets to be solving commercial problems. This probably just like in that Voltron Data case, it probably helps PyMC as well, because when you have commercial applications, not only does it broaden the reach of people who are aware of these techniques, but it is also literally bringing funding in that allows you to support the open source project.
Thomas Wiecki: 00:37:35
Yes, definitely. That was really important for us in launching this company, is that it really should be beneficial to PyMC because that's really the main thing that we all care about. That has really turned out to be true. Yeah. There have been a couple of important learnings doing this. Yeah. You already alluded to it that there's all this inbound interest. The first big surprise was... Well, yeah. A lot of companies, startups, but also really big fortune 500 companies are using PyMC. I never knew about them before the downsides of open sources. You just give it away, so people don't really come back to you only when it's not working. Now, we have the different channel where... Well, people were still coming to us. They wanted to scale it. Yeah. It was much wider use than I had ever anticipated. Yeah. A lot of those companies were looking for experts in using either the library to solve a new problem, or a lot of times they already had used it and were like, "Okay. Well, we like this model, but can you make it faster? Can you add a hierarchy here? Can you make it more fancy, more powerful?" Yeah. It really has taken a much stronger foothold in business and industry than I had originally thought. That's amazing. The other thing is that by using this library in this context, we really see the lack of the library in many cases. By using it for real world data science problems, we're like, "Okay. Well, really this type of new technique is really useful that's contributed..."
Jon Krohn: 00:39:25
The shortcoming.
Thomas Wiecki: 00:39:25
Yeah, exactly. Then, we go in and fix that, which is great for everyone. Right? We now make the library better for that particular customer optimizing it for their use case, but also of course, for everyone else. Right? The whole community and really pushing the software, the package to its limits and beyond that, because as the core developers, we can improve it ourselves and make it do the thing that it needs to do, which it might not have been possible to do before. In that sense, yeah. It has been usually beneficial to the library and also, of course, just in, providing more funding, as you said, for actually now we're able to pay people to work on open source.
Jon Krohn: 00:40:19
Right.
Thomas Wiecki: 00:40:20
Yeah. I like that you mentioned Wes McKinney who also has been always really trying to integrate it to open source and how can we have more business support for it and more funding and there's a lot to explore and a lot of benefit to uncover. I think we're still at the starting point of that, but all these companies. Also, Anaconda who are just really helping the ecosystem. I think that needs to be embraced even more, and I can't wait to see more companies forming around open source packages.
Jon Krohn: 00:40:58
Super cool. Yeah. I love it. It's a benefit to everyone. We have more people being paid, the best people instead of working at a hedge fund like Wes McKinney's a perfect example. He was at Two Sigma, big hedge fund. Even when he was making Pandas, he was working in finance. To make Pandas, he left his finance job, which he'd had for a couple of years. He was living frugally in New York, working for this hedge fund and was able to save up enough money to then leave working and develop Pandas full-time for a while. Very few people can afford that luxury. It's not like that's not a sustainable business model by definition. Then, he was very fortunate to later be working at Two Sigma where... Super successful hedge fund. It seems like they were happy for him to spend a lot of his time developing open source work. But again, that is the exception. It's wonderful to be able to have consultancies like this and like Anaconda, you mentioned, that allow you to have people make a reasonable living. They don't have to be in a corporate job or at a hedge fund. They can be doing what they love, which is developing these super powerful open source libraries. But then, everyone in real time, as you create updates and you upload them, the GitHub, then everyone can benefit in real-time from the amazing work that all of you folks are doing. Yeah. I love this general trend that we're seeing in data science with open source libraries, and I am glad that you're doing it and I can't wait to see where it continues to grow.
Thomas Wiecki: 00:42:40
Yeah.
Jon Krohn: 00:42:42
Do you have any specific examples? Earlier, we alluded to this idea of having very big Bayesian models and that you have some clients at PyMC Labs that are taking advantage of these huge models that thanks to technologies like JAX, we're able to converge on a solution with these very large models in a relatively short period of time. Maybe you have a couple of interesting case studies from PyMC Labs that you can share with us?
Thomas Wiecki: 00:43:13
Sure. I'd love to. Yeah. One of the case studies I really like is the work that we do with HelloFresh, which is a huge multimillion dollar food delivery company. Like almost every other company on the planet, they invest a lot in on marketing, right? Because that's how you get customers. I guess maybe as a general point, this is something that has been really surprising to me in creating PyMC Labs is that a lot of our customers actually are from online marketing. As an industry, marketing seems to be at the forefront of embracing Bayesian modeling. I think there's many good reasons for that. One of them actually is that it's becoming harder and harder in online marketing to actually see what's going on because everyone's getting more privacy conscious, the death of the cookie, where you can track people, they don't like that.
Thomas Wiecki: 00:44:13
How do you know whether your marketing is working or not if you don't have the information you used to have? That's where a class of models called marketing mix models have been embraced more and more. That is what HelloFresh had already built. There's a popular paper of Google actually that describes a Bayesian version of this media mix model. The problem that it solves is... Well, you have marketing. The problem that it solves that you have money that you spend on marketing on different channels, right? You have 10 million spend on Google ads and then five on Facebook ads and Twitter, but also TV and radio and podcasts and influencers. Right? There's all these different channels that you can spend money on. Of course, you want to know how good is each one of them working, but it's very hard to know that exactly.
Thomas Wiecki: 00:45:13
Usually, the only thing you observe is, "Well, this is how much we're spending on that day on that channel, and this is how much user I got in total from all these different things I'm doing." Right? You have many input variables and just only one output variable in trying to reverse that, which at the core, already sounds like a very statistical problem, and it is. It might actually sound like a deceivingly simple problem at first. We're like, "Oh well, that sounds like just a linear regression." At the core, that's what it is. But there's so many complexities once you actually get down to it. For example, if you just keep increasing marketing spend on a channel, it's not that you just will get more and more users and it just keeps growing literally. There's a situation function, right? After a while, users who've seen the ad now five times probably will get turned off even at some point, but they're just not going to be more likely to sign up or buy a product. This media mix model, MMM, that I mentioned has all these complexities already in there and HelloFresh had this model built and then came to us to improve it basically.
Thomas Wiecki: 00:46:32
This is the work that we've done with Luca Fiaschi and there, we came into... This is how, actually, a lot of our projects go. At first, we're like, "Okay. Well, the model takes 30 minutes to run and now let's take it apart, push the dust out and put it back together." Then, it works in three minutes. Doing that process, we then learn, "Well, actually, there's a lot of other things that we can do here." For example, well, these different marketing channels, right? They are not all completely dependent, right? We're spending, let's say $100 to acquire a single customer on Google. Probably on Facebook, we also have to spend around $100 to get a single customer. Just making up numbers. But the point is that probably the effectiveness of these channels will be very similar and listeners who already are familiar with Bayesian modeling will probably know the solution to that because it's a very common trick of Bayesians to say, "Well, we can build a hierarchical model now," where we say, "Well, we have each individual channel and we model each individually, but we also actually have a higher level distribution, which models the similarities between them." Yeah. We model the individual ones and we model the similarities. If one of them is at 100... Well, let's say nine channels are around $100 and then have a new channel that I just turn on, and if I were to estimate it independently, it would say it's $1,000. Right?
Thomas Wiecki: 00:48:17
Something completely different. In that case, the model will say, "Well, actually, no. I know that all of these other ones are in the ballpark of $100, so let me just down regulate that. I'm not going to believe that because I know that it's going to be similar to these." That's one of the things, but then you can go arbitrarily more complex as well where, "Well, so far, what I described is that the effectiveness of these channels are constant over time." Right? The custom acquisition cost will always be $100 on Google no matter what, but obviously that's wrong. During COVID, that threw everything out of whack and things change a lot. These things are changing over time. Then, we went in and built... We put time bearing parameters onto these. Not, "Is it constant over time?" But now, it's actually allowed to change solely over time. For the expert Bayesians listeners, we use Gaussian processes to model this. Then, yeah. Things like COVID, you can model where things go down, but as you can imagine, well, they're probably all going to go down during COVID like these shocks don't just... Well, some shocks might affect just a single channel, but others might affect all of them. Again, that idea of hierarchy is still totally valid.
Thomas Wiecki: 00:49:43
Now, we don't just have all these individual media channels modeled as time serious, but also have a hierarchy on that and allow for the similarities between the changes overtime to also be modeled by this global process that's also time changing. You can see it might get confusing at this point, but that's the point is that these models can become really, really complex and large scale model all these intricacies that you have in your data. But nonetheless, we sample this model on the GPU in a couple of minutes. Yeah. Once it's built and structured in a way to where it's amenable to sampling, which isn't always easy. Right? It's easy to build a model, but not easy to build a correct model that actually works. But once we have done that, you can just scale it. Yeah. That's a model with tens of thousands of parameters and a whole lot of data. Yeah. That's one case study. There are other ones. Yeah.
Jon Krohn: 00:50:55
Super cool example, Thomas. I love that HelloFresh example. You worked in hierarchical modeling, which is something that I wanted to make sure we talked about in the episodes, and now we've done it. You managed to talk about working with very large scale data. Just in general, this was a very elegant application. This media mix model is a really cool application of Bayesian stats. I think it gives listeners a great flavor of what's possible this kind of... By going back to your idea of these Lego building blocks, by having this level of control over the hierarchy and the relationship between various aspects of your data, this is something that is unparalleled relative to other kinds of modeling approaches. Excellent example. Now, something that I can't believe we haven't talked about yet in this episode is MC. We have the PyMC library and PyMC Labs. The Py is Python, but the MC, we haven't talked about that. Your blog is called While My MCMC Gently Samples in a reference to the Eric Clapton song.
Thomas Wiecki: 00:52:14
George Harrison, I think.
Jon Krohn: 00:52:19
Oh, did Clapton cover it? It's one way or the other. They're both famous for playing it.
Thomas Wiecki: 00:52:24
Okay.
Jon Krohn: 00:52:25
I think Clapton may have written it and he has a version, but then George Harrison has a version as well. Also, the two of them had a complex love trying with a woman whose name I forget.
Thomas Wiecki: 00:52:39
Oh, right. Yeah. I have no idea.
Jon Krohn: 00:52:43
Yeah. They were really tight. If I'm remembering all this correctly and I'm stretching, there's probably listeners out there that are shuttering, because they really know The Beatles well or Clapton well or something. Right? My memory of this is that it was George Harrison's partner, and Eric Clapton was friends with George Harrison and this woman ended up leaving George for Eric, but George Harrison is this... He did a lot of meditation. He was a very spiritual guy. My understanding of that situation was that he really accepted it. I think he then got along well with Eric and his ex-partner still. Most of this, I'm pulling from... There's a Martin Scorsese biography of George Harrison called...
Thomas Wiecki: 00:53:42
No way.
Jon Krohn: 00:53:43
I think it's All Things Must Pass, or something like that. No, that's the name of a George Harrison album. I can't remember the name of the Martin Scorsese movie, but there is, and I'll find it and put it in the show notes.
Thomas Wiecki: 00:53:54
Nice.
Jon Krohn: 00:53:54
There's a Martin Scorsese film about George Harrison that is really good. I watched it 10 years ago when it came out and everything that I'm telling you, I'm trying to remember from that. Hopefully, I got at least the broad story.
Thomas Wiecki: 00:54:07
Yeah. Super Data Science podcast, the host for Data Science and Beatles trivia.
Jon Krohn: 00:54:16
We got all sorts here, man. Yeah. While My Guitar Gently Weeps is the song by either Harrison or Clapton or somebody else, but both of them played it and you have a blog called While My MCMC Gently Samples, and it has lots of fun articles that listeners should check out. We'll be sure to include a link to your blog in the show notes as well. The MCMC in While My MCMC Gently Samples is related to the MC, I can only imagine in PyMC, and it stands for Markov chain Monte Carlo, which is a class of algorithms that sample from probability distribution. We've talked a lot in this episode about how Bayesian statistics makes use of probability distributions, and we need to be able to sample from them. These MCMC algorithms allow us to sample from these probability distributions and train our Bayesian statistical models. Why do you love Markov chain Monte Carlo so much? I can guess it's related to Bayesian stats, but I think we could also... You might even be aware of applications beyond just Bayesian stats that data scientists should be aware of MCMC for?
Thomas Wiecki: 00:55:36
Sure. Yeah. I once read a guide on how to prepare for interviews and this type of questions. I actually never had a technical interview in my life, but when I read that, one of the questions was like, "Well, describe your favorite algorithm," which is what the interviewer would ask in the interview.
Jon Krohn: 00:55:54
Oh, yeah. Right.
Thomas Wiecki: 00:55:55
Well, I was like, "Oh, that's a great question." I knew immediately what my answer would be, and that would be the MCMC algorithm. Yeah. I just love it. Because it's so elegant in a way where... Okay. The problem that it solves is you have this probability distribution and it can be whatever probability distribution it can be. It's not tied to Bayesian modeling, but in Bayesian modeling, it often occurs because that posterior distribution that... The holy grail, the thing that we're after in Bayesian modeling, the thing that gives us our uncertainty estimates, that is often intractable. Right? Again, what we said before that it's really nice theory, but analytically, it just doesn't work, so we need that inference button and that inference button is MCMC and it takes this problem that is analytically completely intractable. Then, you say, "Well, if I can't solve something analytically, well, maybe I can sample from it." Right? Then, if I have samples, then I can approximate it. Just do histogram, and that will look similar to that solution that I can't get directly. Then, you're like, "Okay. Well, let's try and sample," but that turns out to be even harder. Right? If you can't analytically evaluate a function, it turns out you also can usually sample from it in the standard ways.
Thomas Wiecki: 00:57:28
Then, it's like, "Well, what if instead we were to have a Markov chain that has the target distribution that has this station distribution, the target distribution?" I'm fully aware that what I just said sounds completely insane, and it is because it sounds insane. I have this distribution, I can't solve it. I can't sample from it directly. Now, you're telling me I have to do this crazy thing with Markov chain. That sounds even harder. How's that ever going to work? But the amazing thing is that that is actually super trivial. You just have this algorithm and I write about this in this blog post MCMC Sampling for Dummies where basically started out with this mathematical description, which is what everyone does with the station distribution, target distribution, which no one understands. Yeah. Mathematicians just like to give it to you straight. Then, I spend many pages explaining what that actually means in more intuitive terms. It's by far the most successful one, which I was quite surprised by.
Jon Krohn: 00:58:30
[inaudible 00:58:30].
Thomas Wiecki: 00:58:30
Yeah. Yeah, exactly. Yeah. It's this thing that does something that is crazy difficult and sounds insane, but actually solves the problem in a very general way with a very simple method. It has all these applications. That is the thing that usually when you write your PyMC model or whatever, when you have your Bayesian model, the sampling algorithm that people develop is tied to that model. Economics, that whole field, they still love to do that. A lot of their papers are like, "Here's a very simple model. Let me now have all these very complex mathematical derivations for deriving a custom sampling algorithm for this particular model." Now, of course, if you wanted to do that, that would be a really cumbersome workflow. Right? You have your cool model. Now, I want to run it. Now, I have to go to the backboard and derive all kinds of equations to build my custom sampler, but these type of Marcov chain Monte Carlo wagon like Hamiltonian Monte Carlo or NUTS just work on pretty much whatever model you throw at it. Yeah. It's one sampler to solve all of your problems. That's why I love it.
Jon Krohn: 00:59:56
Nice. Yeah. We talk about those a lot, the No-U-Turn sampling, the NOT sampling, we talk about that a lot in episode 507 with Rob Trangucci if people want to hear a ton about that. Yeah. Thanks for explaining your love for MCMC. Yeah. We'll be sure to include a link to your blog, so that people can check out that amazing blog post on Markov chain Monte Carlo. Also, while you were speaking, I was multitasking and I dug into a number of the things we were talking about related to The Beatles and that song. The amazing Martin Scorsese documentary is called Living in the Material World. It's George Harrison, Living in the Material World is the name of the documentary and it has 86% on Rotten Tomatoes. I loved it, but be prepared for a long view because it's three and a half hours long.
Thomas Wiecki: 01:00:53
Jesus.
Jon Krohn: 01:00:54
Yeah. I broke it into two parts, but really stunning documentary and goes into a great history of a super interesting Beatle, my favorite Beatle. His songs are some of my favorites and you were right. George Harrison did write While My Guitar Gently Weeps. However, interestingly, in the recording of it on the White Album by The Beatles, Eric Clapton is playing lead guitar, but he's not credited for it.
Thomas Wiecki: 01:01:26
No way.
Jon Krohn: 01:01:27
Yeah.
Thomas Wiecki: 01:01:30
Why would he do that? Why would he have Eric Clapton [inaudible 01:01:33] and not credit him?
Jon Krohn: 01:01:35
Well, actually I have the answer for you right here, which is that, because by that point, there was a lot of infighting between The Beatles.
Thomas Wiecki: 01:01:44
All right.
Jon Krohn: 01:01:48
Lennon and McCartney didn't like the song. They didn't want it on the album. They were apathetic about the song. By bringing Clapton along to record, it helped get it onto the album, I guess. Then, yeah. The woman who was in the love triangle with Clapton and Harrison is named Pattie Boyd. Many of Eric Clapton's most famous songs are about Pattie Boyd, including Wonderful Tonight, Layla and Bell Bottom Blues. They're all about Pattie Boyd, but my memory of exactly... While Eric Clapton was very indiscreetly interested in her, apparently she was never actually with Eric Clapton until she left George Harrison due to him apparently having a lot of drug use and infidelity, including apparently according to this article, with Ringo Star's wife.
Thomas Wiecki: 01:02:48
Oh, man. This goes deep.
Jon Krohn: 01:02:50
Yeah. There you go. It's what everyone wanted [inaudible 01:02:55] this episode.
Thomas Wiecki: 01:02:56
Now, I'm going to watch. Yeah, exactly. That's awesome.
Jon Krohn: 01:03:00
Now that I have completely segued us away from the technical stuff, it gives us actually a good opportunity to talk about business aspects of what you do, which I think are really interesting anyway. I'd be interested in hearing from you how you have built a successful company culture. It isn't that long ago. It's less than two years since you created PyMC Labs and the company seems to be doing well. You've got a lot of really talented developers and statisticians working with you. How did you build this?
Thomas Wiecki: 01:03:39
Yeah. Thanks for that question. It has been really interesting for me to do this because I don't have an MBA. I didn't study business. I have some startup experience from Quantopian and the culture there, and I always loved the culture there. It was just very open and collaborative and they also really were proponents of open source. A lot of that, I copied, but also then extended in various ways. I did, early on, do some research on different organizational principles and found some really interesting resources. One of the books that I thought was really cool was Reinventing Organizations by Frederic Lalouc, and he talks a lot about company culture and also the difference and different styles. For me, it just really opened my mind towards not everything has to have like a strict hierarchy of interns and junior and senior developers and then project managers and VPs, and all this thing.
Thomas Wiecki: 01:04:49
But then, you could also relax these assumptions and have just no hierarchies at all, but fluid hierarchies and just rely more on self-organizing principles. That is really what I have been trying in building with PyMC Labs. Now, a lot of cards are stacked in our favor because we already have been working together, and had already established a certain style of working together from open source. I knew that it was going to be very important to try and maintain the style of working together in open source, which is also much more driven by being self-motivated, right? You can't tell people in open source what to do. They're not getting paid usually for it. Now, they are, but for the most part and before PyMC Labs, they weren't. It really relies on people being motivated to do certain things.
Thomas Wiecki: 01:05:52
It's not that you can direct that, right? You can say like, "Oh, yeah. These are the things that we need to do." Then, you can ask people and be like, "Hey, isn't this something that is cool and excited?" You're trying to not snipe someone into thinking something is cool and then working on it, and that might sound inefficient for a company, but it turns out that that's not the case. I think it's just a very powerful way of running an organization to not have very strict project management and task, to-do lists and KPIs and OKRs, and all of this and you measure everything and people need to do stuff that looks amazing because they want to hunt for the next promotion or the bonus. I think a lot of that is probably well intended, but not really effective or maybe it is effective in some companies in some aspects, but I think it has severe downsides. One of them is... I think it makes people miserable a lot of the times. Once you have infighting right and like, "I want to get that promotion." You shouldn't get it. Right? I definitely, probably not going to be very collaborative in this setting, which is good for me, but not good for the company. Why would you want to incentivize that type of behavior?
Jon Krohn: 01:07:19
Right.
Thomas Wiecki: 01:07:24
Yeah. It was clear to me that I wanted to do something different that was in that very same open source spirit, where people just do things because they're really motivated to do them and they're incentivized to do the right thing for everyone involved for the whole company. There are a couple of things actually that we have built into the structure that is supportive of that. A lot of that, I have to thank Tom Vladeck for who also runs a very cool, successful company. Yeah. He basically proposed this idea of having a structure in which I don't need to go into the details, but it is transparent where everyone knows how much everyone else is making. There are bonus payments, but the bonus payments are basically also calculated according to formula. Right? Basically, it's the more you have contributed to a certain thing in terms of times, the larger that percentage of the bonus is. That way, there really is no incentive to try and appear as if you're doing really important good work. There's just no benefit of looking good. There is only benefit of doing good, because that reflects well on the company. It's making the client happy, which is of course what we want. It makes everyone else and the company happy if you're a nice player and it increases the revenue for the company, which increases the pay for that person and everyone else.
Thomas Wiecki: 01:09:08
Yeah. This type of levers are really powerful, I think, and have this more communal effect where we just all have each other's backs and support each other. I found that to be so critical because everyone also goes through stuff in their life. Right? Some people just have regular life stuff happen to them, have crises and they need some time out or support in whatever other way. Yeah. We invite everyone to be open and transparent and I'm like that too with my people where like if I'm not feeling great, I'll let them know. That way then, we're all ready together and just try and find the best solution to support that. For me, it just created just an amazing work environment where it's really fun to work with friends every day on really amazing, challenging data science problems and contribute to open source and just make the world better in the ultimate spot. Yeah. This is just something that I really have been excited about and really excited about actually working. Right? It could have also just not worked because no one was doing anything, but relying on, I guess the self-motivation of people in building a system where they're actually excited and happy to do the right thing.
Jon Krohn: 01:10:43
Yeah. I think that there would be a lot of organizations out there where this kind of relying on people to be motivated and excited about solving problems bottom up wouldn't work, but when you've got top contributors to an exceptional open source library, I can see how that is the right approach to go with, and probably people really appreciate the extra latitude that that provides each of them. That sounds super cool. Speaking of your company culture, is there anything in particular that you look for in people that you hire?
Thomas Wiecki: 01:11:24
Yes. It's definitely this being self-motivated. That, I think is the most critical thing. Because we don't really have strong structures, which are for us the right thing, just because the things are so fluid. We're really researchers working on ill defined research problems, which require really creative solutions. Yeah. It needs to be people who can operate in this very nebulous environment where things change and are very fluid and you just need to be self-driven to solve these types of problems. That is one.
Thomas Wiecki: 01:12:15
The other one just on the technical side is definitely open source contributions. So far, almost everyone we hired is a PyMC quality developer, but at the very least, has contributed to PyMC because that is already a great demonstration of these type of things. Right? Because that is someone who's just excited to do this type of work, but also for Bayesian modeling, now it's becoming financially more lucrative to invest in this, but before, not really. The people who did that were just like, "Oh, this seems like a cool thing. Let me just do that." Yeah. Then, I guess the type... Also, in general, I would say the type of people that at least PyMC team were just very collaborative and low ego. That also really helps with working together in this setting.
Jon Krohn: 01:13:24
Awesome. Yeah. Super fortunate that you have this pool of great candidates that have self-identified as really strong prospects for your consultancy.
Thomas Wiecki: 01:13:35
The best hiring [inaudible 01:13:36] you could imagine.
Jon Krohn: 01:13:37
Yeah. They're like in an ongoing... They don't even know it, but they're in a continuous job application process where they're submitting code for PR reviews and you get a sense of who's really strong in this pool?
Thomas Wiecki: 01:13:54
Exactly.
Jon Krohn: 01:13:55
All right. Beyond PyMC, which we've obviously talked about a lot already in this episode, are there any other tools or approaches out there that you're really excited about these days?
Thomas Wiecki: 01:14:07
One really cool thing that I came recently across and has rocked the Python world, definitely the Py data world is PyScript, which was presented by Peter Wang at the PyCon keynote that he gave. To me, it's still magic, but basically what it is, it allows you to run Python in the web browser and there have been approaches like that before. Right? Well, just to take a step back, the first thing you do when you want to run a Python program is you go and download Python through, for example, the Anaconda Distribution, right? That installs it locally on your computer, and then you can run it. But of course, the web has revolutionized the way we run programs. A lot of it is just apps now, and they don't require installation because you have a web browser and that is the standardized compute environment.
Thomas Wiecki: 01:15:06
So far, JavaScript was the only language for web browsers, but now there is Python as well through a fairly... Well, actually there's tech stack behind it. Isn't that crazy? Pyodide is one of the tools in WebAssembly, but the end thing is instead of JavaScript, well now, you can just run Python in the web browser and do everything that you would otherwise do with JavaScript in Python. It's not an either or, right? Actually, the two interact really nicely, but now... Well, the really crazy thing about is that [inaudible 01:15:42], it's not just like standard Python was actually also almost the full PyData stack. It's Pandas and [inaudible 01:15:53], and also PYMC, which I recently got to run to that blog post in the show notes. Yeah. Now, you can just... For workshops that we're doing at labs, people don't need to install. They can just go to the website and just directly run PyMC there. You can build interactive web apps with PyMC just running in the browser. Yeah. For me, that's magic and it's amazing, and it just opens up Python to a much wider audience. In his talk, Peter says that he has some numbers that actually 1% Python programmers. Then, just opening it up to the other 99% who can now run that. Yeah. That, for me, is the most exciting new development and something that I was very eager to play around with and eager to see where it goes. I think that's going to just be revolutionary to the whole Python pilot ecosystem.
Jon Krohn: 01:16:54
Yeah. That is revolutionary. This is the first that I've heard about it. Super exciting. Thank you for sharing that with me, Thomas. No doubt, lots of our listeners will be excited by PyScript as well. PyScript aside, obviously we have focused primarily in this episode on PyMC and Bayesian statistics. How can a listener who's interested in either getting started with Bayesian stats or maybe brushing up on their Bayesian stats, what's the best way that they could do that?
Thomas Wiecki: 01:17:27
Yeah. There's the PyMC documentation page, which has tons of examples. There are many really cool books out there. There's one by Junpeng Lao and Ravin Kumar and Osvaldo Martin, that is really cool. But also, with Alex Andorra and Ravin Kumar, I have been working on an online course that will come out any week now that is called Intuitive Base. It basically stems from what I alluded to earlier when I was talking about MCMC where like, "Well, there's the mathematical description, which is completely unhelpful." But the concept behind is actually not that hard. You can explain it intuitive way. Really, that has been the motivation for my blog, which I've been writing for many years is making these concepts easier to understand for someone who does not have a PhD in math or statistics. This basically now is the culmination of years of thinking about how to explain Bayesian modeling in the most intuitive way possible.
Thomas Wiecki: 01:18:40
We have basically put that into a video course with the presentations and notebooks and exercises where we basically go through and explain it to someone who is not really familiar with it, but maybe knows software engineering, who knows Python, were a data scientist who knows machine learning, but wants to learn Bayesian modeling or does someone who wants to have a deeper, intuitive understanding that... Yeah. We basically start from very little, from nothing and then build it up with a focus, not on math, but on code and building intuitions. Yeah. That's Intuitive Base, and I'm really excited about it. It has been this vision that I have had for many years of just finding a different approach of teaching this material, which is often taught in a very obscure way, but it really doesn't have to. It's much simpler than people make it out to be. Yeah. I hope that people will find that valuable and it would be remiss if I didn't mention the excellent podcast, Learn Bayesian Statistics by Alex Andorra where he has all kinds of incredible guests from the Bayesian world explaining all kinds of aspects of Bayesian modeling, so also check that out.
Jon Krohn: 01:20:01
I love that. I can't wait to study your Intuitive Base course. I am a little Bayesian grasshopper who is so excited about it. I have used PyMC in my PhD 10, 15 years ago. I was using WinBUGS on a Mac, so I had to figure out how to...
Thomas Wiecki: 01:20:21
Nice. OG.
Jon Krohn: 01:20:21
I had to dual boot Windows on my Mac in order to be able to... Or emulate it. I had an emulator for running Windows on my Mac, so that I can use WinBUGS.
Thomas Wiecki: 01:20:32
Wow.
Jon Krohn: 01:20:33
I've been dabbling in Bayesian stats for a long time, but I don't feel like I understand it very well. There's so much more that I could be doing, so I can't wait to check that out. This intuitive way of learning, it is similar to the approach that I took in my book, Deep Learning Illustrated where we cover some of these central math, but the idea was to use visuals, illustrations to, as much as possible, have a high level understanding of what's going on, so that you can get into the libraries. You can understand the application areas and start applying them and get excited about the area. All the content that I've been creating since has been like, "Okay. Now, if you're excited about this field, now let's dig into how it actually works under the hood in a lot of detail." This Intuitive Base pedagogical approach... Pedagogical? Yeah. I got that right. Yeah. It's got my blessing and I can't wait to check it out. Beyond that course and beyond the Bayesian stats book that you recommended that I will look up and include in the show notes, do you happen to have another book recommendation for us? I always ask guest for one.
Thomas Wiecki: 01:21:50
Sure. Technical or just in general?
Jon Krohn: 01:21:54
It could be anything.
Thomas Wiecki: 01:21:56
Okay. Well, I mentioned the Reinventing Organization's book. I like that.
Jon Krohn: 01:22:00
Oh, yeah. That's great.
Thomas Wiecki: 01:22:02
My favorite book is probably Hyperion by Dan Simmons, which is like a science fiction book. That one, I would recommend. Then, another one... Because I'm now diving deeper into the whole business aspect, Never Split The Difference is a book on negotiation styles. I really enjoyed that. Yeah. I've been applying it to great effect. Yeah, I would check that out as well.
Jon Krohn: 01:22:33
Well, yeah. We've had a few guests mention that recently or at least one, I don't know. It's come up very recently. I can't remember the exact details, but that definitely seems to be a book that people love, Never Split The Difference. Seems to be worth checking out. All right, Thomas. This has been an awesome episode. I've learned a ton, so hopefully the audience has as well. How can people be following you to stay up to date on the latest that you and PyMC, the software library, as well as PyMC Labs, the consultancy are up to?
Thomas Wiecki: 01:23:08
The best place probably is Twitter, Twiecki on Twitter. We also have a PyMC Labs Twitter account. Most of my blogging these days, I don't do on my blog, which is twiecki.io, but I do on pymc-labs.io. We have a blog there where me and other people from the labs team publishing the latest and greatest stuff. If you're, for example, are interested in the PyScript one, you'll find it there or running PyMC large scale on coil computing and desk, that's on there. Also, the HelloFresh stuff that I mentioned, that is on there. Yeah. Those are the channels where I'm sometimes active.
Jon Krohn: 01:23:55
Super cool. We will be sure to include links to all of those in the show notes for listeners to check out. Thomas, thank you so much for being on the show. Maybe in a couple of years, we can check in on how you're doing and have another super informative episode.
Thomas Wiecki: 01:24:10
I loved it.
Jon Krohn: 01:24:17
Wow. What a blast learning from Thomas today. Brilliant, fun and easygoing dude. I think he's going to have a lot of success with PyMC Labs. In today's episode, Dr. Wiecki filled us in on how Bayesian stats is the modular Lego approach to building data models, how the PyMC library allows us to automate the sampling of probability distributions and estimate Bayesian models, how media mix models are an example of how Bayesian models can be uniquely effective for tackling large scale data problems, how hierarchical Bayesian models provide extra flexibility and power for representing real world complexity, how fluid hierarchies pay disclosure and a clear algorithm for bonus computation can enable employees to be both impactful and satisfied. We talked about the revolutionary PyScript library for running Python within a web browser's HTML, and he filled us in on the Intuitive Base course for learning Bayesian statistics intuitively yourself.
Jon Krohn: 01:25:16
As always, you can get all the show notes, including the transcript for this episode, the video recording, any materials mentioned on the show, the URLs for Thomas's Twitter profile, as well as my own social media profiles at superdatascience.com/585. That's superdatascience.com/585. There were lots of book recommendations in this episode, weren't there? If you like book recommendations, check out a detailed spreadsheet of all of the book recs we've had in the nearly 600 episodes of this podcast by making your way to superdatascience.com/books. All right. Thanks to my colleagues at Nebula for supporting me while I create content like this SuperDataScience episode for you. Thanks, of course, to Ivana Zibert, Mario Pombo, Serg Masis, Sylvia Ogweng and Kirill Eremenko on the SuperDataScience team for managing, editing, researching, summarizing, and producing another killer episode for us today. Keep on rocking it out there, folks, and I'm looking forward to enjoying another round of the SuperDataScience podcast with you very soon.
Show all
arrow_downward