115 minutes
SDS 507: Bayesian Statistics
Subscribe on Website, Apple Podcasts, Spotify, Stitcher Radio or TuneIn
We dove into Bayesian statistics from history to tools you can use to learn about Bayesian stats, we round off technical content at the end by looking at the world of statistics PhDs and the life of a statistics PhD candidate.
About Rob Trangucci
Rob is a PhD candidate in Statistics at the University of Michigan. His research focuses on developing novel Bayesian statistical methodology for problems in epidemiology, and creating tools to understand how modeling assumptions impact inference. He has also done research in multilevel regression and poststratification. Before pursuing his doctorate he worked as a data scientist in fintech and publishing, and was a core developer on the Stan project. He got his Master’s in Quantitative Methods in the Social Sciences from Columbia University and his BA in Physics from Bucknell University.
Overview
Rob and I have known each other for some time. I’ve wanted to have him as a guest for some time because of his deep expertise in Bayesian statistics. Back in June, he was doing work on a Bayesian statistics library which brought him back to the top of my mind. Rob attended Columbia for his Masters and became interested in a course called Missing Data. After leaving a post-graduation start-up, he reached out to colleagues who happened to be working on a huge open-source library. He was brought on to the Stan project.
Stan is a project built in C++ that is both a statistical modeling language and a suite of inference algorithms. Most people interact with it through an interface—often R or Python. There are many ways you can interact with it via your software or language of choice, but the most common routes are the above two. Developing in C++ was the most computationally efficient way to account for the amount of memory any given variable is using. Many inference algorithms require gradient information and Stan computes that incredibly quickly thanks to specialized statistical modeling code.
From here we dove into what Bayesian statistics even is. I think of it as frequentism which is a way to attack a problem via asking if you were to vary the data under some distribution, how would your estimates change? These are hypothetical data sets while Bayesian looks at data sets you have in hand. Under Bayesian, you’re making inferences but based on the concrete data in front of you. One downside is you can’t get probability statements without assuming something about a parameter space which means quantifying your beliefs about the parameter. You then update your beliefs with the posterior understanding. But many people say “I want the data to speak for itself” which is a great mentality but often not possible because even when you don’t believe you have an inference about the parameter space, you often do. Oh historical note, Bayesian is older than frequentism, going back a few hundred years to Thomas Bayes in the 18th century. Ultimately Rob thinks the two approaches answer two separate questions based on either hypothetical distribution vs real-world inference. He calls Bayesian inference decision making. From here we got into a deep technical conversation around multi-modal deep learning, gradients, and other topics within statistics, data science, probability disruption topics.
We dove into the Stan packages, named after Stanislaw Ulam who worked in Los Alamos and invented the Monte Carlo method. The basis of Stan is a Hamiltonian sampler. This system was invented in molecular dynamics, which follows Hamiltonian motion. Radford Neal at the University of Toronto brought the Hamiltonian movement to statistics that allow for better exploration of posterior inferences. In addition to Stan, Rob calls out other Bayesian stats libraries such as BRMS and rstanarm.
As far as application, Rob has had some research projects in his PhD specifically focused on COVID. Rob is part of a group that focuses on Bayesian statistics for epidemiology and early on in the pandemic, they were working with the state of Michigan to help understand the trends in caseloads and special distribution of cases in the state. He was involved in data processing and cleaning as it came in to help to put together a website that would show an up-to-date map in disease rates across Michigan. They noticed quickly that race and ethnicity were missing in the data round COVID cases. This lack of data stymied the knowledge around race and ethnicity disparities in COVID cases and deaths. This problem of missing data has been a professional focus of Rob’s for many years now, so they had to come up with a method to overcome this challenge of missing data. They created a model that would take in population data from the census to learn the rate of disease by race and the proportion of missing race data by race.
From there we dove into the day-to-day life and work of a PhD candidate in stats. For the first few years, you take coursework with your cohort followed by larger projects and research in specific labs in your area of interest. At that point, the day-to-day becomes writing code and simulating data in models. Rob has held many jobs between his academic work and thereafter. It turned out to be undergrad, three years of consulting, a master’s, several years of work, and now he is back in academia getting his PhD. This apparent rollercoaster was the result of, more or less, dealing with time while pursuing his interests.
In this episode you will learn:
- Getting Rob on the show [8:12]
- Stan [9:34]
- Gradients [18:15]
- What is Bayesian statistics? [23:05]
- Multi-modal deep learning [45:20]
- Stan package [53:46]
- Applications of Bayesian stats [1:09:47]
- The day-to-day of a PhD in stats [1:21:56]
- What does the future hold? [1:42:37]
Items mentioned in this podcast:
- Stan Project
- DataScienceGO Connect
- Jon’s learning resources
- A Conceptual Introduction to Hamiltonian Monte Carlo
- BRMS
- rstanarm
- Michigan COVID-19 Mapping
- Statistical Rethinking by Richard McElreath
- Bayesian Data Analysis, 3rd Edition by Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, Donald B. Rubin
- Stan User’s Guide
- Dead Wake by Erik Larson
- The Splendid and the Vile by Erik Larson
- Mathematical Foundations of Machine Learning
Follow Rob:
Follow Jon:
Episode Transcript
Podcast Transcript
Jon Krohn: 00:00:00
This is episode number 507, with Rob Trangucci, PhD candidate in statistics at the University of Michigan.
Jon Krohn: 00:00:12
Welcome to the SuperDataScience podcast. My name is Jon Krohn, chief data scientist, and best-selling author on deep learning. Each week, we bring you inspiring people and ideas to help you build a successful career in data science. Thanks for being here today, and now let's make the complex simple.
Jon Krohn: 00:00:48
Welcome back to the SuperDataScience podcast. Today's episode with Rob Trangucci is absolutely epic. Over the course of his career, Rob has alternated between spells in education and industry. After attaining his MBA in quantitative methods from Columbia University, he worked as a data scientist at a financial technology startup, and then, as a statistician at Columbia, specifically in the lab of Andrew Gelman, who is himself, perhaps the best-known statistician of our time. During that period at Columbia, Rob became a core developer on the open-source Stan project, a leading software library for drawing inferences from data using Bayesian statistics. Fascinated by deep, challenging and important problems in Bayesian statistical research, Rob began his PhD in stats at the prestigious University of Michigan in 2017, and has been working on that since. If you haven't heard of Bayesian statistics before, today's episode we'll introduce you to what it is and why in many common situations, it's uniquely powerful relative to other approaches to modeling data like machine learning and frequentist statistics.
Jon Krohn: 00:01:59
If you have heard of Bayesian stats, even if you're expert in it, today's episode is a rich resource on the centuries-old history of the approach, its strength, its real-world applications, including to COVID epidemiology research, Robin's particular focus at the moment, the best software libraries for applying Bayesian statistics yourself, and the best resources out there for learning about Bayesian stats yourself. Near the end of the episode, Rob rounds the technical content off by detailing for us what life is like day to day for a PhD student in statistics at a top institution. While you might want to consider doing a stats PhD yourself, and the options that are available to you after you complete one. Today's episode is definitely on the technical side. We did our best to explain everything in an engaging and accessible way, but for long stretches, Rob gets deep into the theory and practice of Bayesian stats, which will definitely be easiest to follow along with, if you have a working understanding of data distributions and data modeling. You're ready for this epic and content rich episode? Let's do it.
Jon Krohn: 00:03:10
Rob, welcome to the program. I'm so excited to catch up with you on air. Where in the world are you calling in from?
Rob Trangucci: 00:03:17
It's great to be here, Jon. I am in La Crosse, Wisconsin, which is sort of the west most border of Wisconsin. Right against the Mississippi river, which separates Minnesota from Wisconsin. Maybe like two hours Southeast of Minneapolis.
Jon Krohn: 00:03:42
Nice.
Rob Trangucci: 00:03:42
And like, yeah.
Jon Krohn: 00:03:43
And we're recording in summertime. So I imagine it's great there, it sounds like a kind of place that would be a really cold winter.
Rob Trangucci: 00:03:50
Yes. It is. Typically, has pretty bad winters. We luckily avoided the winter. When we came to La Crosse, we came sort of early March. But yeah, the summers here are really beautiful. The only wrinkle was, and you probably know this, but there were wildfires in Canada, like north of Minnesota. And we were getting smoke from Canada through Minnesota and Wisconsin.
Jon Krohn: 00:04:28
Actually I didn't know about this.
Rob Trangucci: 00:04:31
And it was sort of like... It felt very dystopian because I was using my pandemic mask to filter out the smoke from the wildfires. It was just like, this is the world that we live in, but it was an interesting time.
Jon Krohn: 00:04:48
I know what it's like. I've gone through that. I lived in Singapore for 12 months. And at the time that I was there, there were fires, which were started by farmers of peat in Indonesia. So Singapore borders, Indonesia, and they were burning peat to make room for land, for growing things and cattle and whatever. But peat is an extremely dense carbon store. And so not only is it one of the worst possible things you could be burning for the environment. It creates a lot of smoke. And so my first experience wearing a mask all the time was long before the COVID pandemic living in Singapore in the smoke. So I know what that's like, but has that cleared up now? That's better?
Rob Trangucci: 00:05:30
Yes. Yes. When it's not raining and it happens to be raining today's. Blue skies, very sunny. It's a nice area. Being right on the Mississippi is sort of... It's very nice.
Jon Krohn: 00:05:45
Yeah. I don't know if I've ever been to the Mississippi anywhere in the US.
Rob Trangucci: 00:05:48
Have you ever been in New Orleans?
Jon Krohn: 00:05:50
No.
Rob Trangucci: 00:05:52
Okay.
Jon Krohn: 00:05:52
But I heard the jazz is good. The jazz and the Mississippi. So we've known each other for a long time. I think since about 2014, which is seven years ago. We met through a colleague of mine, Michael Silver. So at that time I was working at a company called Annalect, which is the data science subsidiary of Omnicom, a giant media conglomerate in New York. And Michael was a product manager there. Based on a LinkedIn check that we did just before recording, it looks like Michael is now the lead product manager actually, and you confirm this verbally because you stayed in touch with him, but he's the lead product manager at Product Hunt, which is a super cool company. If people have startups, you're definitely aware of Product Hunt because getting your consumer facing startup product onto the front page of Product Hunt is a huge boom to the success of your tech startup. So very cool job from Michael there. I don't know if you want to tell us about how you know him or cool things you know about Product Hunt.
Rob Trangucci: 00:06:58
Yeah. So Michael and I met in second grade, at summer camp. We were both going to Riverbend summer camp in New Jersey. Then we both, we went to elementary school together. And so we've been friends for a very long time. I've sort of very luckily been able to stay in touch with my high school friends, sort of throughout college and afterwards. And I think at the point where we met, we might have all still been living together in the financial district.
Jon Krohn: 00:07:46
Yeah. I think that could be right. I think he was introduced to me as your roommates.
Rob Trangucci: 00:07:51
Yes.
Jon Krohn: 00:07:53
Yeah. How times have changed, I guess he's in California now.
Rob Trangucci: 00:07:57
No, he's in Brooklyn.
Jon Krohn: 00:08:00
Oh, he is. Okay. And well, he has gone far away. And so we'll talk about that. So you came to mind for me, actually, I'd had you on my mind as an amazing guest for the SuperDataScience show when I took over right away from Kerrville on January 1st. So I created this kind of list of who do I know that would be a great guest on the show. And you came to mind because you have deep expertise in a fascinating area of data science, Bayesian statistics. So we're going to talk about what Bayesian stats is, how it compares with other statistical approaches, what its value is in the world and how it allows us to solve problems in new ways. But before we get to that, the thing that most immediately reminded me that I was like, oh, I should reach out to Rob and see if he wants to be a guest on the show, is that you were doing a workshop on a Bayesian statistics software library called Stan back in June, I think it was.
Rob Trangucci: 00:09:01
That's right.
Jon Krohn: 00:09:02
And so, that your name popped up in this email from Jared Landers company. So it folks who are listening to episode 501, we had Jared Lander on the show. So we mentioned how he runs a company called Lander Analytics. He also runs the open statistical programming meetup in New York. And I subscribed to both of those newsletters and both of them, I mentioned the Stan workshop that was coming up and mentioned you by name Rob is one of the people who were running this workshop on Stan. And so you've been a core developer on the Stan project since 2014. How did you get involved with being a core developer on a library like this? So at the time you were a Columbia university statistician, you were working in Andrew Gelman's lab. So if people aren't aware of with Andrew Gelman, perhaps the world's best known statistician. And also talked about in a fair bit of detail in episode 501 with Jared Lander. So yeah. How does someone get involved in being a developer on a big open source project like this?
Rob Trangucci: 00:10:09
And I think it makes most sense to start with my grandfather's grandfather.
Jon Krohn: 00:10:16
Big open source developer. I remember.
Rob Trangucci: 00:10:20
I was at Columbia for my masters and I took a course from Ben Goodrich. He was an instructor in the sort of master's program that I was at. And it was my favorite course that I took in the master's program. It was called missing data, which has sort of been a through line in my career. And it's connected to research that I'm doing right now. So I left Columbia. I was working in a startup. I left that startup and I was looking for something to do. And I reached out to Ben. And at that point, Stan was sort of in its infancy, it was sort of the brain child of Andrew Gelman, Bob Carpenter, Matt Hoffman, Daniel Lee, Michael Betancourt, all of these researchers at Columbia. So I reached out to Ben. I was looking for something to do. He had talked about this software package called Stan when I took his missing data course. And I said, "Hey, look, I've got time on my hands. I'd love to contribute." And he said, "I think you could come in and work for Andrew and work full time on the Stan project. Is that something you'd be interested in?" So I went to Columbia, I met with Andrew, I met with Daniel Lee, I met with Bob carpenter. And I loved the team. I thought it was just a nice group of people. They were doing interesting stuff. And so I signed up.
Jon Krohn: 00:12:22
You may already have heard of DataScienceGO, which is the conference run in California by SuperDataScience. And you may also have heard of DataScienceGO Virtual. The online conference we run several times per year. In order to help the SuperDataScience community stay connected throughout the year from wherever you happen to be on this wacky giant rock called planet earth. We've now started running these virtual events every single month. You can find them a Datasciencego.com/connect. They're absolutely free. You can sign up at any time. And then once a month, we run an event where you will get to hear from a speaker, engage in a panel discussion, or an industry expert Q&A session. And critically, there are also speed networking sessions where you can meet like-minded data scientists from around the globe. This is a great way to stay up to date with industry trends, through the latest, from amazing speakers, meet peers, exchange details, and stay in touch with the community. So once again, these events run monthly. You can sign up at datasciencego.com/connect. I'd love to connect with you there.
Jon Krohn: 00:13:35
So maybe not the most common way to get involved with an open source project. Meet people in person... I mean, meeting people in person is probably one of the best ways to get involved with these kinds of things. So, for example, in the New York area of the open statistical programming meetup that I already mentioned, this is a great way to meet people. I mean, certainly, if you're in the New York area, once we're post pandemic and Jared is having meetups in person again, that is a great example of a community to come to and say, "Hey, I have this level of experience in programming or this kind of statistics", or whatever your expertise is and to kind of stand up and just say, "I would love to be contributing to open source project." And there will be tons of people in that room who are working on them. And people are always looking for more contributors if those kinds of projects. So I think that to meet in person route is totally viable. I haven't contributed to someone else's open-source project in a meaningful way. So my guess is the other way to get involved is to be reviewing, get commits, submitting your own poll requests and get, I think often, if you look at a major library like TensorFlow or PyTorch or Python or whatever, all of these are open source projects, there's often actually specific details on the read me page of the GitHub as to how to get involved as a contributor.
Rob Trangucci: 00:15:07
Yes. Yeah. I mean, I would say when I first joined, I think the community was smaller. It's since grown a lot. And both through the Stan forums, which is sort of one way to get information about Stan, where a lot of the developers are active and they respond to posts. But also on GitHub. And so I think a lot of... Even I think on GitHub there are on that sort of issues page on GitHub, there are tags for sort of like indicating what would be a good first project. And there's really, there's no... If you're interested in it and you can write C++ code, or you want to learn how to write C++ code, that's a good way to get involved.
Jon Krohn: 00:16:01
Nice. And the reason why you mentioned C++ is because that's what underlies Stan specifically...
Rob Trangucci: 00:16:05
That is what underlies Stan.
Jon Krohn: 00:16:08
So if you want to contribute to an open source project in general, you don't necessarily need those announcements. But, so I always think of Stan is something that we call from R.
Rob Trangucci: 00:16:19
Yes. Yeah. I started to talk about what Stan was a little bit earlier, but Stan is sort of both a statistical modeling language. It's like a domain specific language for statistical modeling and a suite of inference algorithms focused mainly on Bayesian inference, though there are frequentist inference algorithms included. Stan itself is a collection of C++ libraries. And some sort of OCaml transpilers from taking Stan code and compiling it down to C++ code. The way that most people interact with Stan and how I interact with Stan on a daily basis was through one of the interfaces. So if you want to be able to write your programs, compile them and run them, and get whatever results out that you want. You're going to use an AR interface, either command Stan R or R Stan there, Python interfaces. I mean, they're really many different ways that you can interact with Stan, via your sort of software of choice. Though the most common packages are, or the most common routes are VR via Python.
Jon Krohn: 00:17:52
Nice. And so I guess the reason why kind of at its core, in the guts, the reason why you would develop in C++ is because it's the most computationally efficient way you can use. You can be so thoughtful about exactly how much memory a given variable is using up, for example.
Rob Trangucci: 00:18:13
Yes. Yes. It is. Stan manages its own memory. Without going into too much detail, a lot of inference algorithms require gradient information and Stan compute gradients very quickly and has a bunch of specialized code geared towards statistical modeling, much like TensorFlow does, that make taking gradients of probability density functions fast. And so that algorithm is implemented in C++.
Jon Krohn: 00:18:59
Yeah. I think this is something that we should open up and talk about it a little bit here as to what gradients are. We can just spend a couple of minutes talking about that. So when we mentioned libraries like TensorFlow or PyTorch, a lot of people think of those libraries as a deep learning library, because that's kind of what they became famous for. They became famous because they made it easy to build the layers of an artificial neural network that makes it the deep learning model, but they also include functions that make it easy to differentiate that model and train it. So this idea of differentiating comes from differential calculus, which is one of the main branches of calculus. And it's in calculating these differentials of a model that we get this gradient, which effectively just means the slope of a relationship between how wrong your model is at predicting the outcome that you'd like it to predict.
Jon Krohn: 00:20:04
And the current model weights that your model has. So it's a slope, it's this relationship. And so you can basically at any given point in time with your model, you can say, "Okay, based on the training data that I have, and the model weights that I have, and how wrong my model is, how can I adjust any one of my model weights?" Well, you can use the gradient. You can say, "Okay, well, there's this slope. So if I increase this particular model weight out of the thousand or the million or the dozen model weights in my model. If I increase it, then my model will be more wrong." So, that's not the direction we want to go. We want to decrease this particular parameter, my model. And therefore my model will be less wrong at this particular point, with these particular training data. And so by iterating that over and over and over again, it allows models in a lot of different paradigms, including machine learning and Bayesian statistics, which we will talk about in detail shortly, to learn. And if all of that sounded interesting, but you don't know much about it, I'm going to plug something quickly for myself here, Rob. My apologies, but...
Rob Trangucci: 00:21:12
That's okay, you don't have to apologize. It is your podcast.
Jon Krohn: 00:21:16
Well, it's your episode. And so, but if you're interested in learning about gradients and the calculus that allows these gradients, you to compute these gradients and how these gradients allow machine learning models to learn, and also allow Bayesian statistical models to learn. I have a calculus for machine learning course that I'm rolling out every Wednesday on YouTube. I publish a new video to that course. It's on YouTube. So yeah, so it won't be hard to find, we'll have it in the show notes. Anyway, so a big digression on gradients there, but there's just such a... I don't think I've talked, I haven't talked about gradients in detail on the show before. And they are at the core of so many models, so many models of learning.
Rob Trangucci: 00:22:02
Yeah. Yeah. Yeah. I think it's a great description. And something that I think people don't often think about, but should, because ultimately that is what makes learning to happen for most algorithms. And so it's good to understand what's going on below the hood.
Jon Krohn: 00:22:23
Yeah. I think there's a lot of people out there. I mean, that's why I've been creating this content on the mathematical foundations of machine learning, because I think there's a lot of people out there who get used to training a machine learning model using scikit-learn or using a tool like Stan. And in the beginning, that's kind of satisfying. You're like, "Oh, this is look at these amazing things I can do." Then you're like, "Why is this happening?" And it's kind of disconcerting. You're like, well, and if there's a problem with the model doesn't train, is it because of something I don't know about how this works, and there's a good chance the answer is yes.
Rob Trangucci: 00:22:59
Yes.
Jon Krohn: 00:23:03
So, all right. So I think it's time. There's going to be a lot of exciting parts in this episode, but I think one of the most interesting, Rob, is going to be explaining to the audience. And I can't wait to hear your description of what Bayesian statistics is. And you actually, you mentioned a term earlier, which now I feel like I'm probably mispronouncing. I always say frequentist statistics, but you said frequentist statistics. And I've heard that before too. What are your thoughts on that other branch of stats? What should we call it?
Rob Trangucci: 00:23:31
I mean, look, the Bayesian statistics and frequentism, I think answer different questions about the problem that you're facing. And I tend to think that most problems are best attacked the Bayesian inference, but there are good examples of problems that might be better through a frequencies' lens. So frequentism, it typically asks if I were to vary the data under some prescribed distribution, how would my estimates change? And so you're thinking about hypothetical data sets, not necessarily the data set that you have in hand. If you are going to do a T test or something or compute a P value, you have some null model that you're measuring your statistic against. You're really measuring the distribution of the statistic under the null model against the statistic that you do observe. And you can essentially compute that P value, which is just under the null model. What's the probability of observing a statistic. That's at least as large as what I have observed. And it's sort of a way to reject this hypothesis that your model is the null model. And that's often not what people want. People want, okay, what is the probability of my parameter being in a certain interval based on the data I have.
Rob Trangucci: 00:25:16
And that's, I think a very natural question to ask, and that's what Bayesian inference allows you to do. And it comes with a cost. There are two, I think two primary costs though. I would argue one is a benefit, but it's often framed as a cost. You sort of, you can't get these probability statements for free. You have to assume something about your parameter space before you observe your data. You have to essentially put your beliefs about the perimeter into, or quantify your beliefs about the parameter, using a probability distribution. And then, you observe your data and then given your data and your prior beliefs about that parameter, you get a posterior area to sort of update your beliefs about the parameter space. And so sometimes people say, "Okay, well, I want to let the data speak for themselves." And I think in theory, that's a great sentiment, but often we have information about the parameter space. Even if we don't think we do.
Rob Trangucci: 00:26:31
A good example is in a regression problem. If you have, let's say centered in scale all your predictors, we know pretty well that you're not going to have a regression coefficient that is 10 to the 5 or 10 of the 5. Right? And so you can encode that sort of information in your prior. The second downside of Bayesian inference is computation often, because the class of models for which, everything I described, you have a prior distribution, you've observed some data, you get sort of updated set of beliefs after you observed the data. That sort of computation, which is really the calculus and probability. The class of models for which we can do that in a closed form is very small. That's something called conjugate priors. And so most of the time, if you have sort of an arbitrary prior, an arbitrary likelihood, there's no closed form solution for the posterior. And so you'll need to, ideally, you'd like to sample or a way to compute quantities of interest from your posture distribution is to sample from your posture and then compute expectations. And I've been talking for a little bit, so I...
Jon Krohn: 00:28:22
No, it's all been perfect. So, yeah. So you talk about these two supposedly limitations of Bayesian statistics versus... Actually my original question, it was a much simpler question. So you answered the question that I was going to ask right after, which is what's the difference between Bayesian stats and other approaches. But I specifically like the main other approach that I was going to have us compare against was the other big branch of stats, which I pronounce it frequentist and you pronounce frequentist. I mean, so let's answer that question, and then I...
Rob Trangucci: 00:29:03
I would, I mean, I have heard people say frequentism, but I don't know, maybe in Canada people have a different appetite? I don't know.
Jon Krohn: 00:29:14
I don't know. I don't know how many Bayesians I've ever met in Canada, or-
Rob Trangucci: 00:29:17
I think there are a lot of [inaudible 00:29:19].
Jon Krohn: 00:29:19
I'm sure there are, but I did most of my education abroad. So maybe it's a British name, maybe frequentism is a British sounding thing that I picked up. But anyway,
Rob Trangucci: 00:29:30
That could be.
Jon Krohn: 00:29:31
I think people always know what you're talking about. Anyway, so it doesn't really matter. So then you answered the question I was going to ask after, which is about, what is Bayesian statistics and maybe compare it with frequentism. So there's a historical thing, which I think is interesting that we can bring into the picture here, which is that Bayesian statistics is a lot older than the frequentist approach. So, I can't remember the exact dates, but it's something like the late 18th century or maybe even the early 18th century, he was like a religious leader? I mean, I guess a lot of educated people at that time, they were like the priest locally, but his name was Bayes. Thomas Bayes, I think.
Rob Trangucci: 00:30:16
Mm-hmm (affirmative). Mm-hmm (affirmative).
Jon Krohn: 00:30:17
And so Thomas Bayes came up with a formula that was the earliest known equation of what you described in more detail now that has been generalized by people since, which is this idea of having some prior belief and then using data to update that prior belief, which is going to be a number and Bayesian statistics, to some posterior value. And then you can use that posterior value to make decisions confidently, to know the world a little bit better. Now, interestingly, there are no photos of Thomas Bayes or paintings, obviously there's no photos. Nobody knows what he looked like, because I've tried to put him in a deck, there was no way for me to do that. So I picked Laplace who had done a lot of work. Right.
Rob Trangucci: 00:31:15
Yeah. That's what I was going to say. I don't know the historical background super well, but I do know that Laplace made a lot of contributions, sort of took Bayesianism from this sort of spark of an idea to make it a more workable.
Jon Krohn: 00:31:41
Exactly. He generalized it from just like one equation to a generalized equation, a body of statistical ideas, a way of making confident, quantitative inferences about things we observe in the world. And so, yeah, Pierre-Simon Laplace, he was a Frenchman and he was a polymath as a lot of these kinds of people were at that time, making big contributions. So he kind of popularized this idea and then for about a century, it was the leading way. So this Bayesian statistical approach that Laplace popularized was the leading way of trying to make thoughtful, quantitative inferences about things we observed in the world with data. But then in the early 20th century, it fell out of fashion. And I think in large part because of these things that you mentioned, so these two drawbacks a century ago in the early 20th century were a big problem. So, first the compute thing, you had to do everything by hand. So the large compute associated with Bayesian statistics was a pain in the butt, I guess then, because you're like, "Ah, more pencils. I keep running out of all the pencils." I don't even know if they use pencils. So then, "We're running out of feathers."
Rob Trangucci: 00:33:05
That's it. That's it. Feathers and ink.
Jon Krohn: 00:33:08
But then the other thing was this idea of a prior making people find kind of feeling icky. So like, R. A. Fisher, a century ago is like, "No, we can't have this in here. It's unobjective, you can't start off objectively observing data if we're having these prior beliefs in advance." So I don't know, now I've been talking for a while. I feel like I should give you a chance.
Rob Trangucci: 00:33:34
Yeah, no, I mean, it's, there's an interesting dichotomy between the two disciplines but I do think that they answer two different questions. So one is this, what is the probability that my parameter is in this interval? The other is based on the sort of hypothetical distribution of the data, what are the family of intervals that,
Jon Krohn: 00:34:08
Yeah.
Rob Trangucci: 00:34:08
I could observe? Ultimately, I think Bayesian inference, a benefit is potentially decision-making. And I think that is, I think, just a general area that it was a very in fashion in the 1950s and sixties. And then it sort of got picked up by the economists and statisticians, for some reason. Certain statisticians were very into decision theory, but I think that's an area where statistics could continue to make important contributions to the world, and I think there's a real opportunity for that. And it's not necessarily new ideas, it's just reteaching, rehashing things that are [crosstalk 00:35:10]. Yeah.
Jon Krohn: 00:35:12
And so, because of this kind of people like R. A. Fisher, usually influential statisticians a century ago, poo-pooing this idea of prior information and made this [inaudible 00:35:24] compute, if you learn statistics in university in the 20th century, certainly the latter half of the 20th century, or almost certainly even in the 21st century, everybody learns, if you've done a psychology degree or a biology degree or a chemistry degree or physics degree, an engineering degree, when you take your stats 101, probably all the way through to like your third year, maybe fourth year stats classes as an undergrad, you study exclusively this frequentist approach. So going back to your two big issues, the latter one, the compute one, every passing year, that's half of the issue it was the year before. So this exponential decrease in compute costs means that we don't have to worry about compute nearly as much. You still do, but it's less and less of a problem all the time. And then the issue of the prior, I guess that this is more of a philosophical debate than a technological debate, but the prior in Bayesian statistics to say that "Well, there's that assumption that we make, and it's a problem."
Jon Krohn: 00:36:47
There's so many assumptions in frequentist statistics anyway. So what distribution you're using, the assumptions you're making about this theoretically unknown in an unobservable population distribution that you can never measure that you're just assuming exists, so you have all of these assumptions baked into frequentist stats anyway. And even the arbitrariness of a 0.05 threshold, especially when we have lots of observations that we're making, frequentist statistics has tons of assumptions anyway. So, okay. We're making some anyway, why not a prior and one of the cool things about a prior that you could do and kind of make it uninformative anyway is so you can sample from a uniform distribution over a broad and reasonable range, or maybe you even have other ideas, but there's ways that you can make these priors uninformative and, please do let me know better ways than what I just described for doing it.
Rob Trangucci: 00:37:37
Yeah, no, I would say that an uninformative prior in one parameter space is an informative prior in a different perimeter space.
Jon Krohn: 00:37:49
Right.
Rob Trangucci: 00:37:49
And so I would argue that it's rare that you can really have a totally uninformative prior. And I would also maybe argue that you might not want to, just because you typically do know a reasonable range for things, and you don't want to be too doctrinaire and say, "I don't think that this parameter can be any larger than 10 or any smaller than -10," and essentially put boundaries on your parameter, unless there are extremely good reasons to do so,
Jon Krohn: 00:38:31
Right.
Rob Trangucci: 00:38:32
Because if you're wrong and the parameter is 11, you won't ever learn it.
Jon Krohn: 00:38:37
Right.
Rob Trangucci: 00:38:38
And so you want to have like a soft constraint. Your prior should put, I don't know, 99% of that sort of that probability mass between two points, but, there are tails where you can be wrong. That means you still learn something.
Jon Krohn: 00:38:55
That makes perfect sense. Now something that I've been dying to bring up, and I'd love to hear if you feel the same way about this, because you know a lot more nuance about this than me, but this idea of starting with maybe an arbitrary value or maybe a meaningful value, but in machine learning, we start with typically a random value, sampled from a distribution. So if when we create a neural network, a deep learning network, let's say we have a million parameters in our neural network that we want to be able to learn. Well, we initialize those million parameters with random values, like a prior, though like a prior, we don't call them a prior, but you start with this randomly initialized value and to get those randomly initialized values, the typical thing in machine learning is to randomly sample from a distribution. So maybe if it's a deep learning network, then we'll sample from Hay distribution, or Silicone distribution or something like this, flora distribution.
Jon Krohn: 00:40:01
So you have these particular kinds of distributions that you can sample from that will give you reasonable starting values for your initial value span machine learning model, which in the Bayesian world we could call a prior. And then you use training data to update those parameter values and when your model has stopped learning, when it no longer is learning on validation data that are outside of the training dataset, we say, "Okay, my machine learning model, my deep learning model," say, "has finished learning," and then you have these final model weights that are like a posterior in Bayesian stats, and you can use those for production inference. So did I just say a whole bunch of things that make you feel really uncomfortable, or is there truly this kind of parallel between Bayesian statistics and machine learning?
Rob Trangucci: 00:40:59
I'm going to have to disagree a little bit, but yeah. I mean, this is not my area, I don't know very much about neural nets or deep learning. What I do know is that the sort of objective function can be super multimodal and often you're finding a good local minimum of this super high dimensional.
Jon Krohn: 00:41:29
Rob, Rob, Rob, I only ever find the global minimum. [crosstalk 00:41:33] I've never gotten stuck in a local minimum. Carry on.
Rob Trangucci: 00:41:36
So, yes. So I feel like, the starting values, you are sort of localizing yourself to one of these modes. And so in that sense, it is a bit like a prior, but I would argue that it's more like a regularizing problem and less like a posterior. Because you get a point value.
Jon Krohn: 00:42:02
Right.
Rob Trangucci: 00:42:05
Something like the sort of key to Bayesian inference and a big difference compared to frequenttism, you pick a point, you're going to get a point in the parameter space that if you're doing maximum likelihood, that maximizes the likelihood function of the observed data. And with Bayesian inference, you need the distribution, you want all the parameter points that are consistent with the data you've observed and your prior, sort of weighted correctly via base role. And this is something that Michael Betancourt talks about a lot, a Stan developer and someone who thinks a lot about the differential geometry of posteriors and how... there's the algorithm that Stan uses is called Hamiltonian Monte Carlo, and it's a variant of that. And we can talk about that a little bit later, but he's done a lot of thinking about the differential geometry of this algorithm. And so something that he often says is that you need to quantify a distribution and that's different than just picking a point. And so I think in your example, you're going to get a point out, but the Bayesian would want the full distribution. That is very hard with a neural net because it's multimodal. And so you either need... And this is where I'm sort of out of my depth I don't know very much about, I know that there is Bayesian deep learning and it sounds very hard to me because,
Jon Krohn: 00:43:52
Yeah, yeah, you're totally right. And that is a huge distinction. You're completely right that in Bayesian stats, and that is, I've made a huge note here to make sure I emphasize this in my outro to the episode that there is this big difference between using what you call the point value a single number as both the initialized weight in the machine learning model, as well as the weight that we have come out of it, that we've learned, and that we can use in a production model. With Bayesian stats, you have that the prior value is a distribution, and after training, you come out with a distribution. So it's this whole other dimension of richness that doesn't happen to machine learning, not withstanding as you mentioned there are people who work on Bayesian machine learning, like Bayesian deep learning as a specific subfield of that.
Jon Krohn: 00:44:47
So that is, yeah, definitely a big difference. So maybe me talking about it in that way, the parallels, hopefully if you're familiar with machine learning, maybe it kind of gives an idea a picture of how Bayesian model trains. Well, not really how it trains, but kind of how it proceeds. And I'm actually really glad to have had this conversation because I think I have been making too many parallels between Bayesian stats and machine learning, but I will now be more careful about that. A word that you used a couple of times though, that I'd love to understand better and that our audience might love to understand better, as well, as you said, that machine learning is multimodal, or deep learning is multimodal. What do you mean by that?
Rob Trangucci: 00:45:32
So, I mean that when you look at the, let's say like the training error or something as a function of the parameter values of your sort of neural network weights, there are many points in the parameter space that minimize the training error or the validation error,
Jon Krohn: 00:45:56
Right.
Rob Trangucci: 00:45:58
And they're sort of pretty much equivalent in terms of the value of the objective. And so it's hard to sample from a multimodal distribution, it's hard to optimize a multimodal distribution because one of the things that we talked about earlier is gradients, right?
Jon Krohn: 00:46:20
Mm-hmm (affirmative).
Rob Trangucci: 00:46:20
So gradients are local information. You just know at the point that I'm at right now in the parameter space, how does the objective function change locally? If you remember your calculus course, it's dy/dx, and those are infinitesimal values, but you sort of want global information. You want to know what is the global maximum? And so if you just picture like a sort of a distribution of two humps, or something, you can use an algorithm to find one of the modes, the mode is one of these peaks. And, your algorithm will stop there.
Jon Krohn: 00:47:04
Multimodal. Ah.
Rob Trangucci: 00:47:06
Yes. So your algorithm is going to run until it finds that the gradient is about zero, and that'll indicate that you've reached the maximum and will terminate. And you'll never know that there was this other mode that you missed, and it's this other set of parameter values that maybe leads to a lower validation error, training error, et cetera. And I'm not up on the latest deep learning research. My understanding is that these are multi-multi, multi-modal objective functions and that most modes are about the same. It doesn't seem to matter too much.
Jon Krohn: 00:48:01
Yeah. There are just so many parameters in a typical deep learning model that every time you initialize and then randomly sample from your dataset to train the model with stochastic gradient descent, which stochastic just means that we're randomly sampling, [inaudible 00:48:18] once, every single time you're going to end up with a different set of parameter values that solves the problem to roughly the same extent. You get roughly the same cost on your validation dataset, you minimize it to the same level, so like you're saying, there are, infinite probably isn't the right word, maybe approaching an infinite number of ways that this million parameter neural net can solve a problem. And that's kind of what you're getting to with this multimodal idea is that the mode just being a peak of a distribution. So if you think of a distribution with one peak, then the mode is a good estimate of the average alongside your mean or your median, but a distribution can have multiple modes, it doesn't have to have just one peak. And the techniques that we use, the gradient descent technique that predominates in machine learning, including in deep learning, can get trapped in one mode arbitrarily. There's kind of ways around this and there's ways that you can retrain your model again. Basically, because of what you're describing with it being as massively multimodal space, you can end up converging on an optimal parameter space, despite the presence of all these different modes.
Rob Trangucci: 00:50:08
Yep.
Jon Krohn: 00:50:12
We've gone really deep. Unless you have something else specific to say about that. [crosstalk 00:50:18]
Rob Trangucci: 00:50:18
So I'm going to say that in like one in two dimensions, the mode can be sort of a good description of where the sort of probability mass lies, but when you get into higher dimensions, that's sometimes not the case. Like if you look at, and again I'm parroting Michael Betancourt, who has written a lot of great stuff about high dimensional probability distribution sampling from high dimensional probability distributions, but if you think about a normal distribution in many dimensions,
Jon Krohn: 00:50:56
Rob. How many dimensions can you visualize in?
Rob Trangucci: 00:51:02
Three. We're talking more than I'm able to actually visualize. I'll just sort of use a term that I don't like so much, but it turns out that most of the mass of the distribution, when I'm talking about mass I'm talking about the integral over the density, the normal density times this differential volume element. And so these two quantities, the density and the differential volume element sort of play against each other. And so, if you think about a normal [inaudible 00:51:44] distributions peaked, like the density is peaked around zero, and that's true as you increase dimensions, but the volume grows much faster than the density decays. And so it turns out that the most of the mass of the probability, a distribution of a multi-dimensional normal distribution is sort of in a shell, like a thin shell.
Jon Krohn: 00:52:08
Whoa. Wow, wow.
Rob Trangucci: 00:52:09
And so, the mode can be a bad description of a distribution, because that's just dealing with the density. So as Bayesians, we're interested in computing expectations of functions of the parameter values with respect to the posterior. What's the posterior mean of a parameter, what's the posture standard deviation of the parameter? And so these expectations, these are integrals across the entire parameter space against this density. And so we need to quantify areas of high probability mass, which necessitates exploring all of the probability distribution, ideally focusing on areas of high mass, because those are going to be the biggest contributors to your expectation values. Again, this is really nothing new. This is a Michael Betancourt special, and I'll send a paper to you, Jon.
Jon Krohn: 00:53:22
Perfect.
Rob Trangucci: 00:53:22
That maybe the,
Jon Krohn: 00:53:24
Yeah, we'll put it in the show notes.
Rob Trangucci: 00:53:25
People can check out, yeah.
Jon Krohn: 00:53:27
Yeah. Perfect, amazing Rob. So when I'm talking to you about topics like this, it's really humbling because it shows me how much more I have to learn, but that was a beautiful description of what we're trying to achieve with Bayesian statistics. So let's talk specifically now about the Stan package. So first, does Stan stand for anything or it's kind of got, statistics is the first three letters, that's kind of how I think about it. Do you know anything about why it's named Stan?
Rob Trangucci: 00:54:02
Yeah. It's not an acronym it's named after a physicist who worked in Los Alamos who sort of invented the Monte Carlo method earlier I mentioned. That it's, [inaudible 00:54:23], we want to take expectations, [inaudible 00:54:25] integrals. We often don't know the probability distribution because we have some arbitrary prior arbitrary posterior. And so in order to compute these integrals, we need to use the Monte Carlo estimator for these integrals. So, and so it's named for Stanislaw Ulam, [foreign language 00:54:43] so Stan is just the first part of his name. Yeah.
Jon Krohn: 00:54:51
Cool. So we talked about Monte Carlo very briefly in episode 499 with Barr Moses because their company is called Monte Carlo, but I barely went into any detail about it, because that company doesn't actually have anything specifically to do with Monte Carlo methods. So maybe you could tell us just a couple sentences about this idea. I know as the name Monte Carlo comes from the idea of a casino, like roulette tables or rolling the dice, we're getting these kind of random probabilities, but maybe add a bit more color for us.
Rob Trangucci: 00:55:33
Yeah. So let's say you have an integral you want to compute and you have some distribution that has a density function. To get the expected value for a random variable, that's distributed, according to this distribution, you have to do an integral. And so we can't do most integrals. They're just very complicated. But if we can sample from that distribution, we can use just the simple average of samples from that distribution to get an estimate for the integral that we really want. And if we want the expectation of a function of this random variable, we just sample from the distribution, take the function value of each draw, and then just take the average of that. And so, that's sort of like a law of large numbers result, that as the number of samples that you get diverges goes to infinity, your estimate, and this average sample average will come converge to the expected value. And yeah, it's sort of undergirds most of modern methods in Bayesian statistics for sort of exact generating, doing sort of exact inference. There are a lot of approximate inference methods like variational Bayesian inference, expectation, propagation. Those are different methods that sort of take the sampling problem, they turn it into an optimization problem. But you trade some fidelity for that trade.
Jon Krohn: 00:58:03
So when I first started learning about Bayesian Statistics at the beginning of my PhD, so this was 2007, 2008, the only software library that anybody mentioned to me for doing Bayesian inferencing was BUGS. And BUGS only ran on Windows as far as I can remember at that time and it was called WinBUGS. And so I had a Mac, as I've had for a very long time, a UNIX Based Operating System. I had to dual-boot Windows on this box to get WinBUGS running. Have you ever used WinBUGS?
Rob Trangucci: 00:58:52
I have not. I think I've used JAGS, which is a non-Windows native implementation of BUGS. It stands for Bayesian Inference using Gibbs sampling, that's BUGS, and JAGS is just another Gibbs sampler. These were tools that allowed people to do inference on models that they hadn't been able to do inference on before, but they sort of exhibit some pathologies. For certain problems that are very common, for instance, hierarchical models. I sort of mentioned that as the number of samples grows to infinity, you get the sort of right answer with your Monte Carlo estimator for the expected value. And the question is "How many samples do you really need to accurately quantify this expected value?" And it turns out that for some problems, the Gibbs sampler, you just won't ever get enough samples. We can't run computers for infinite amounts of time. We have to stop them at some point. We're always getting approximations to this expectation value that we're looking for. So Gibbs sampler sometimes would need to be run for millions and millions and millions of iterations and still not be very close to getting good answers.
Jon Krohn: 01:00:36
Ah, cool. And so whether we're talking about BUGS or JAGS, the GS at the end of those acronyms is this Gibbs sampler. And so I think you mentioned earlier in this episode, an alternate approach to Gibbs sampling, which you said Hamiltonian, right? And so that's kind of what the basis of Stan is. It's going to be unavoidable to get into some level of technical detail, but this Hamiltonian sampler allows us to converge with more reasonable sample sizes than infinite or millions like we have an expected value that approximates what it would be if we could sample infinitely as we converge on infinity. Tell us about that. Tell us about Hamiltonian in general. Also, that reminds me, I think, is that related to this idea of a No-U-Turn sampler?
Rob Trangucci: 01:01:41
Yeah. So that Hamiltonian Monte Carlo was, I think, invented in molecular dynamics for generating samples of molecular dynamics to quantify-
Jon Krohn: 01:01:59
-because they follow a Hamiltonian process. It's something like if you follow how a speck of dust moves on the top of the surface of water, it follows Hamiltonian motion. It's a particular kind of randomness. Right?
Rob Trangucci: 01:02:13
Yes. I think so, that sounds right. So the Hamiltonian Monte Carlo became much more well known to the Stats world because Radford Neal at U Toronto wrote a paper about it. I think he was using it in his thesis about the neural networks to sample from the posture for these neural networks. So Hamiltonian Monte-Carlo is a way of using gradient information to help you explore this posture or distribution better, you have to find areas of high mass. And it turns out that Hamiltonian Monte Carlo can do that. And there are papers about it, again, the stuff that Michael Betancourt has written is really good for understanding that sort of thing. But it turns out that Hamiltonian Monte Carlo has a lot of tuning parameters and they were hard to set. A lot of hand tuning went into Hamiltonian Monte Carlo. Matt Hoffman and Andrew Gelman at Columbia wrote this paper called the No-U-Turn sampler, which was sort of an adaptive variant of Hamiltonian Monte Carlo and included two important contributions.
Rob Trangucci: 01:03:47
Without getting into too much detail, Hamiltonian Monte-Carlo involves approximately solving differential equations, partial differential equations. So you're doing numerical integration. So a question that you have to answer when you're doing numerical integration is, "How big is your discretization step and how many steps do you run to get to essentially solve your partial differential equations?" The No-U-Turn sampler paper did two things. It was able to adaptively set the discretization step size in what's called warmup or in Gibbs sampling burn-in phase. And then, it will adaptively set the number of integration steps. It essentially waits until the sampler is going to do a U-turn and then it terminates and it samples uniformly from this path that is generated. It turns out that this is a very robust way to sample distributions and it really opened up a whole world of models that were not fitable to invasion stats. We're still learning the limits of this algorithm, it's very powerful, and that's like Stan's bread and butter. It's implementing the No-U-Turn sampler, which it's nuts for sure. Stan's implementation has changed over the years, to be more robust and to be faster. But it's great. I use it a lot in my research, my applied research, a lot of it is using Stan and I think a lot of people, a lot of researchers, a lot of companies use it.
Jon Krohn: 01:06:15
Nice, so let's jump to that next. I want to talk about applications, your work in particular and when we might want to use Bayesian stats, let's get to that. But first, quick thing, just to sum up this idea of NUTS sampling, No-U-Turn sampling. I guess with other Hamiltonian Monte Carlo approaches or with Gibbs sampling, certainly, maybe this is part of why the Gibbs sampling ends up needing infinitely, you can't even converge with infinitely the size, because I guess, because you are making all the U turns. You make progress in one direction and then it just turns around and starts going the other way. Instead of converging on the parameter value that is suggested by your data, you're just heading off in the wrong direction all the time.
Rob Trangucci: 01:07:09
Yeah. You're random walking. That's sort of the death knell of any Monte Carlo sampler, it's random lock behavior. If you can avoid that and sample with purpose, then you're good. And HMC and NUTS are good at that and Gibbs sampling, in higher dimensions, can become a random walk, which is bad.
Jon Krohn: 01:07:37
Cool. All right. So we talked about WinBUGS and JAGS, which you probably don't want to use given this conversation. Stan, is obviously a great package. If you want to be doing Bayesian stats. It's going to be efficient because it's got C++ under the hood, but you can call it from R or Python in the command line. So it's convenient for a lot of the software languages that data scientists are familiar with. Another library I know of is PyMC3, that library. Are there any other libraries that you feel like deserve an honorable mention? A Bayesian stats library people should be looking at?
Rob Trangucci: 01:08:18
Yeah. Stan is wonderful, but it's going straight into: writing your own Stan code is like jumping into the deep end and that can be a little scary at first. A good intro to that is using a package like BRMS or something like RStanArm. Both of these allow the user to use more common model specification language, like the linear model specification language in R, the LME4 hierarchical specification, or multi-level model specification. And that can be a good introduction to essentially write a model like you might be using a package in R, but still be able to get the benefits of sampling with Stan.
Jon Krohn: 01:09:29
Wow. Okay. Amazing. So Rob, you have given us an introduction to Bayesian stats. You have left us with libraries that we can start getting going with Bayesian stats on our own like BRMS or RStanArm. So now tell us about applications of Bayesian stats, things we can be doing with Bayesian stats that we couldn't be doing with frequency stats or machine learning, and maybe a good place to start with that is your own PhD research. So we haven't even talked about this on the show, but you're doing a PhD. You're most of the way through it, maybe, at the University of Michigan. I know that since the pandemic, you have had some research projects focused specifically on COVID, so that could be a particularly cool use case to talk about.
Rob Trangucci: 01:10:26
Yeah. So back in April, when COVID was unfolding-
Jon Krohn: 01:10:38
-April 2020?
Rob Trangucci: 01:10:40
2020. Yeah. I'm part of a research group at University of Michigan called Epibayes, short for epidemiology or Bayesian epidemiology. The group is led by Jon Zelner in the School of Public Health. Jon is an epidemiologist who does a lot of work using Bayesian inference in epidemiology. Early on, this group that I'm a part of was working with the state of Michigan to help understand some of the trends in the COVID caseload and specifically the spatial distribution of COVID cases in Michigan.
Jon Krohn: 01:11:39
Spacial Distribution, we're not talking about the high dimensional probability distribution. You mean where in the state.
Rob Trangucci: 01:11:48
Yeah, where in the state, yes.
Rob Trangucci: 01:11:50
Is it in Detroit? Or is it in Grand Rapids? We started getting line level case data from the state of Michigan. I was involved in writing some of the R code. This is very low level, just data processing, cleaning the data, bidding by census tract and public use micro data areas, stuff like that. We essentially put together this website called covidmapping.org, which shows an up-to-date map of the state of Michigan and the rates of disease across the state. At some point in our investigation of the data, we noticed that race and ethnicity was missing in a lot of COVID cases. This is not specific to Michigan. This is a problem across the U.S. I think there was an Atlantic article about how race and ethnicity data was often incomplete with the COVID cases. There are naturally questions that people want to ask about. Are there disparities in COVID-19 incidents by race and ethnicity? That is stymied naturally by missing race data, because if it turns out that one race has more missing data, like the rate of missing data is higher for that race compared to another race, then when you take the ratio of these two rates of disease, you'll get the wrong answer.
Rob Trangucci: 01:14:08
It's one of these very tough missing data problems where we have reasons to suspect that the value of the missing data point, the race of the COVID 19 case patient. The probability that that observation is missing a race, might be dependent on the race of the case patient for a few different reasons. The missing this is driven by non-response when people get a COVID test. If you've gotten a COVID test, you will know that they'll ask for your race when you fill things out. It can be missing due to just mistakes in data handling after COVID cases, essentially a case is recorded in the lab and then sent to the state as part of its surveillance program. The research question became, "How do you accurately quantify race and ethnicity disparities in COVID-19 incidents with missing race data?" I should say, this is just one in many potential biases in the data. Something that we don't touch and won't touch is the fact that testing rates, we're just working with PCR-confirmed COVID-19 cases. We don't have a way of correcting for testing bias with the sort of case data, though there are efforts to help understand that bias a little bit better within the group. Essentially, we have this missing data problem and it's the missing data problem of the worst kin`d that-
Jon Krohn: 01:16:21
-you were saying that dealing with missing data has been something that has been a focus of yours for a long time now.
Rob Trangucci: 01:16:28
Yeah. I mean really Ben Goodrich, shout out to Ben Goodrich, taking this class at Columbia really changed the way that I thought about missing data. The challenge was to come up with a method to help deal with this missing ratio. So we have a model that, it turns out, if you use census data, like population data from the census, you can learn essentially these two different quantities, which are the rate of disease by race and the proportion of missing this by race. That's the proportion of missing race data by race. The reason that we end up using Bayesian inference in this problem is that we're trying to describe this missing data process, the process that leads to the missing data and the disease process. It becomes pretty high dimensional because the data has spatial information attached to them. That's how we're able to marry it with census data. We think that rates of disease are local. They're driven by local transmission. Missing this might be local. It might be at a higher level, like a county level. The model becomes very high dimensional very quickly. We also have good prior information about some of the parameter values. All of this combines to make Bayesian inference a good way to attack the problem. Something that I didn't mention earlier, but might now, is that it is very hard to do optimization in high dimensional parameter space. Sometimes, a way to aid with that is to add prior information, some way to regularize essence.
Jon Krohn: 01:19:00
Okay. I see. With some kinds of problems, like this problem, where we're trying to identify the proportion of missing data for a particular race, it's a complex problem space that we're in, parameter space. By having some prior information, maybe historical rates of missing data by race or something, you can start off your algorithm, your Bayesian statistical model with this prior information, and then the model doesn't have to search as vast of a space. You kind of have this region that you think is worth exploring and that might lead to an optimal solution.
Rob Trangucci: 01:19:57
Yeah. Yeah. I think that's a good way to put it.
Jon Krohn: 01:20:00
Yeah. Nice. Much better than my machine learning Bayesian stats parallel.
Rob Trangucci: 01:20:10
Yeah. I'm working on a paper right now that we will submit soon with my advisor at Michigan, Yan Chen.
Jon Krohn: 01:20:24
Oh, I'm laughing. Cause I was thinking, that's right before the episode, right before we started recording, so something that I always ask guests, so that you're aware listeners, I always ask guests if there's anything you'd like me to be sure to mention. So guests often, they have a book that just came out or there's some hiring they're doing, a particular kind of client they're looking for. And for Rob, because he's a PhD student, what we need you to do is cite Rob's papers.
Rob Trangucci: 01:20:52
That would be great. Yeah.
Jon Krohn: 01:20:56
All of your academics out there get reading Rob's papers and citing Rob's papers.
Rob Trangucci: 01:21:02
I appreciate the shout out. The other author on there is Jon Zelner. We'll submit it soon. Maybe it gets accepted and if it does, you'll hopefully have something to share with everybody.
Jon Krohn: 01:21:21
Yeah. We need all of our listeners who are reviewers at prominent Bayesian statistical journals to also approve all of Rob's submissions with no errors.
Rob Trangucci: 01:21:36
I can't imagine that would be a problem.
Jon Krohn: 01:21:39
I'm sure that happens all the time. All right. That's a great example of a project that we can tackle really well with Bayesian stats. I love how it has a social impact as well. That's a nice thing. What is it like, day-to-day, to be doing a PhD in stats? We probably have listeners, certainly, we have listeners who have done PhDs in stats or machine learning or another quantitative discipline or some other discipline, but probably most listeners haven't. Some of them are maybe thinking, would I like to do a PhD? What's it like to be doing a PhD in stats at a top institution like the University of Michigan? What do you do day-to-day?
Rob Trangucci: 01:22:26
Well, it changes by year. The first couple years, you are you taking the coursework along with your classmates, your PhD cohort mates. That is exciting. It's hard. This is not the case anymore, but it was the case that, between your first and second year, you had to take a qualifying exam. You do a year's worth of coursework and then you have a written test and a take-home applied data analysis problem, and you have to pass both of them to continue in the program. I'm very happy that the current students don't have to go through that because it was very stressful to go into a situation knowing that this is existential, it's tough.
Jon Krohn: 01:23:35
Yeah. So I'm going to interrupt you and then I'll let you continue on explaining what the program's like. But, the graduate program that I did, it was what they call a one plus three programs. It was this neuroscience graduate program at Oxford and the first year was a master's and we actually had, similar to the kind of thing that you have in the U.S., we had classwork, we had two research projects, but the vast majority of our time was spent doing classwork, which is a bit unusual in the European system, unlike the North American system. In the European system, you typically, go right from an undergrad, which is usually a three-year undergrad, and you go right into what is often just a three-year PhD without this year to kind of get familiar with the space. You kind of have to come from the discipline in a way, if you want to do a PhD in stats, you probably have to do a Bachelor's in stats.
Jon Krohn: 01:24:33
Some universities are starting to roll out. Some European universities are starting to roll out these kinds of programs. Mine was this. The first year was this master's and there were 15 of us doing this master's in neuroscience. And five of us were on the one plus three program, which meant that we continued on with this PhD afterward. But we did have to pass the master's and it also had a written test. But it's kind of funny because I think we all have this impression that it was really easy going into it. Maybe some people stressed out about it, but I didn't. And I don't think a lot of people did. And all you had to do was pass it. It didn't matter how as long as you passed and I think everyone got through. It's interesting that different universities, obviously universities can do things their own way, but then at Michigan, it was this harrowing experience, whereas mine was eh.
Rob Trangucci: 01:25:34
I think for some people, it was not. The backgrounds of people coming into the PhD program varied.
Jon Krohn: 01:25:42
That's a really, really good point. And so that allows me, if I do have a listener who is one of those 15 people that did the master's in neuroscience with me, I owe you an apology because actually that is exactly it. It's that my undergrad was actually in the area. So I think that probably also goes a long way.
Rob Trangucci: 01:26:07
The first couple years were tough because I had gone from industry. I was working in the the Stan programming language. I worked at a FinTech startup. Yes.
Jon Krohn: 01:26:22
Yes. I'm going to interrupt you and give the audience more background on your background, because this is an interesting in and of itself.
Rob Trangucci: 01:26:27
Background on the background.
Jon Krohn: 01:26:29
So we're going to get back into this PhD and what it's like. You first did a Physics Bachelor's degree at Bucknell. And then you did consulting for three years at a place called Photon. And then, you did a master's at Columbia in the quantitative methods in the social sciences.
Rob Trangucci: 01:27:03
I mean, it's like, it just rolls off the tongue. I can't believe you didn't get it right the first time.
Jon Krohn: 01:27:04
There's quantum meth course you did. And then after that you worked at Ascendum for a year, which was a tech startup in the data science space, financial, financial data science technology kind of thing, which you can go into more detail on. Then you worked with Andrew Gelman. So at Columbia university, but not as an academic, as a statistician, as a professional statistician. And then again, you went back to academia. So you did your undergrad three years doing consulting, then a master's and then several years working as a data scientist or a statistician. And then again, decided to go back and do a PhD. So yeah, what's going on there? Why would you do that?
Rob Trangucci: 01:28:00
Yeah. The first time I went back to school, I honestly, I just missed physics. I missed the sort of math and I was faced with all these really interesting questions. Most of my drive to go back to academia is driven by problems that I face, in the industry that just don't have good solutions. And so I want to go study them and it's sometimes depending on where you are or you just don't have the time, there's no time to really spend, thinking through sort of the best approach you really need a solution fast. So at Photon I was doing a lot of industry forecasting. I got really interested in econ micro economics. Then when I went to Columbia, I ended up taking a stats course. I loved it. And I sort of just went all the way into stats. And then again, when I was in industry at the FinTech startup, and also I did a bit of consulting as well. There were just problems that everyone faces that just don't have great solutions. One of which is, the all-important prior invasion stats. Like how do you think about the impact that that prior has on your inferences? It's like, how do your assumptions impact the conclusion to draw? Which it's an important thing to be able to quantify. And it's a tough problem. It's something that I've done a little bit of research in.
Rob Trangucci: 01:29:52
And so I was driven by the desire to help build tools, to help people understand their models better, to think about, okay, for certain classes of models. What's a good prior to use for this parameter. At the time, I was really interested in calcium processes. There's a way to think about neural nets as a finite dimensional approximation to a Gaussian process, which is this infinite dimensional probability distribution. And so what we were talking about earlier, these in neural networks are multimodal Gaussian processes can suffer from the same issues and they're multi-modal in their hyper parameters. And if you just think about a one-dimensional function and it's an unknown function, you want to learn about it. Sometimes you'll say, okay, it's linear. And then you'll use linear regression to learn the function. Other times you don't know very much about it, but you think it's smooth or, and you think it has it wiggles a lot. Potentially, these are things that you think you might know about the function. Gaussian processes are ways to put priors over functions, to learn about the function and these like the notions of wigglingness and magnitude, how much does the function oscillate? How big is it from peak to trough, stuff like that? They're hyper parameters that control the prior and your Gaussian, the sort of posture for the Gaussian process can be multi-modal in these two hyper parameters, the magnitude and the length scale it's called, or sort of wavelength, you can think about it. And so I want to think about how do you put priors on the hyper-parameters because we know it's good to integrate over all the sources of uncertainty that might come into a problem. So that was the drive was like, there's this hard problem. And there's not a lot of great research about it.
Jon Krohn: 01:32:29
So yeah. So essentially the reason why you've gone back to academia both times was because there are these fascinating mathematical topics that really wanted to dig into in detail. And I think that's a beautiful reason to go back and do a PhD. So do you think now that you will stay in academia after the PhD or that you'll go back to industry? Or maybe that's not a great, I don't know if you...
Rob Trangucci: 01:32:57
I'm thinking about both options. I don't know. There are great things about academia. I like teaching. And though I hesitate to say that something that you can't do in industry, because you do a lot of teaching and it sort of just dawned on me that that's an option. I like applied problems. And so I, the something that I did before I went to grad school, I enjoyed a lot, was we were working with a big publisher that needed to figure out how to price it's ebook back catalog. And we sort of came up with a way to do that and sort of at its core, it's like a decision problem. And I think in industry, the problems that you're faced with are often of the decision sort. It's like, what do I do with this data? Here's a decision I need to make. I have data. How does that data inform my decision? So I like that view a lot and I don't, I think sometimes in academia, you're not doing as much of that sort of decision-making, it's more about the inference. So yeah, that's a long way of saying, I don't know, but I think both paths could be very interesting for me.
Jon Krohn: 01:34:38
Cool. Yeah, no doubt. And for sure I think there's more and more opportunity in disciplines like data science broadly, including being a statistician. I think there is a lot of opportunity to go into industry and still teach and maybe even do research and certainly, the big tech companies or whatever, they do a huge amount of academic research. And so there are increasingly career options that blend both. And so you don't really have to choose. And hopefully that blending ends up being the best of both worlds instead of the worst of us.
Rob Trangucci: 01:35:20
Right.
Jon Krohn: 01:35:20
I think that's the idea.
Rob Trangucci: 01:35:25
Yeah.
Jon Krohn: 01:35:26
And yeah, I think it can work. Okay. So we've digressed now I ended up getting you to talk about the post PhD stuff. We didn't even finish explaining what it's like to do a PhD. So we'd gotten through the kind of early years. I think it is typical. Whether Michigan now doesn't have that into first year exam. I think that that kind of thing is still typical that a lot of universities, but then, so I guess after that, then you go into one particular lab and you start to really dig deep into your particular research at that point?
Rob Trangucci: 01:35:57
Yeah. You usually like your summer between your first and second year, you do research. So you pick a professor in your department that you think would be fun to do research with and you spend a summer doing research and that gives you a taste for sort of what they do, how they manage things. That's a big piece of it is like, do you work well with this other person? And after the summer, if you enjoy working with them, you'll often continue to research through, your second year intermittently, between coursework or on top of coursework, if you can do so. And then really your third and fourth years, you typically do something called a prelim exam, which is like an oral. You're going to write a paper, you give a presentation, it's sometimes thought of as like a proposal and dissertation proposal. And though it doesn't always function like that. It can often just be, here's a project that you spent a lot of time on and present, your findings and almost everybody passes those, your advisor doesn't let you go up for it, if you're not ready. And then your third, fourth, fifth years, you're doing research. And it's typical for stats, PhD students to do sort of three projects, three separate projects that you put together into a dissertation with a somewhat common thread. It can also be, you can be the opposite where you have, you've just station a one topic, but mine will be three.
Jon Krohn: 01:38:06
Yeah. I think that's more common for sure.
Rob Trangucci: 01:38:10
Yeah. And then you spend, my day-to-day now is spent writing a lot of code, running simulation studies. That's where you simulate data from any scenario and you fit your model to these datasets and you see how your model performs in a bunch of different criteria like mean square error coverage. Whether or not your sort of posture intervals or your confidence in roles with your frequent is cover the true values that generate the data. And I do a lot of working through math problems that are related to my work and then writing maybe in the last like two weeks have been very heavy, just writing. And that has actually been something that I have heard. And it hasn't really dawned on me until now is how important, obviously you were in a PhD to write your dissertation, but it's the writing process. It just can help clarify and generate new ideas that you, if there's something really clarifying about trying to present your work to other people that is unique to writing, I do a lot of presentations and writing sort of a different thing compared to that, but I've been surprised at how much I enjoy it. And so yeah,
Jon Krohn: 01:40:13
It's good you enjoy it. Or else if you don't eat the most harrowing part of a PhD.
Rob Trangucci: 01:40:19
It's part of that. Yeah.
Jon Krohn: 01:40:22
But totally a hundred percent. Being able to write something, there's something about, I don't know if I've ever articulated this before, and I don't know if I will do it particularly well right now off the cuff, but when you write a book or thesis, the paper, this document has to capture this snapshot of everything at once. If we have this conversation, there's opportunities for interjection and going off in particular places and it doesn't have to be this one solid contiguous piece of a big concept.
Rob Trangucci: 01:41:11
Yeah.
Jon Krohn: 01:41:12
When you write a paper or book or dissertation, it's like from beginning to end, it has to have this one particular flow. And it has to, you have to be able to, you get to page 300 and it says, your like, refer back to page one. The whole thing, those hundreds of pages all have to work together contiguously.
Rob Trangucci: 01:41:38
Yep.
Jon Krohn: 01:41:40
And that is a really challenging thing about it. But as it all starts to come together, it really is a beautiful thing. And it certainly does lead to more research ideas and where the gaps are in what you know, and so on.
Rob Trangucci: 01:41:56
Yeah. I think that's a good way to put it.
Jon Krohn: 01:42:00
Got it. All right. So that's awesome. It's hearing what you're doing day to day, writing code, simulating data, math problems, teaching, writing papers. Yeah. It sounds like a great thing to be doing for somebody who wants to dig deep into a topic and is now getting to do it. So, all right. So here's a question that I don't ask a lot of guests, but it's actually one of my favorites to ask. So you've been doing this PhD, you have industry experience, so you're aware intimately probably a lot more so than many academics, maybe at any stage you're aware of how the research that you do discovering new things. So things come along, the no U-turn sample that we talked about. That allow problems to be solved with data that just couldn't be solved before, deep learning maybe another example in recent years, so these things come along, these data modeling innovations and those happen in the context of these continuous exponential changes all around us. Ever cheaper data storage, ever cheaper compute, ever more than ever more abundant sensors all over the place, sensing all different kinds of things and collecting all the new kinds of data, interconnectivity, the speed of the connectivity between us, and the ability to share, through papers and archive, code in GitHub, virtual conferences, in-person conferences, this constant sharing of ideas. So because of all of these factors, technology advances at an exponentially faster pace each year. So is there anything in particular that excites you about the future for you or your offsprings or something that you'd like? I don't know, just some vision that you see unfolding.
Rob Trangucci: 01:44:17
Yeah. I think, this is something that we talk about in this Stan project a lot is just that as data sources and computing power grows. So too must the models that we use to understand the world. And I would say I'm excited prospect of these ever-growing, I'll say hybrid models where you have, all this data collection that's going on, incurs extra uncertainty and often they're latent parameters that are joined together in complicated ways. And I think we're just at the beginnings of understanding the potential for models like that. And also the ways that we understand those models. And I think there's a lot of interesting work to be done on both sides, like the putting together of the models and inference on them and understanding the sort of operating characteristics of these models. We talked a little bit about the COVID-19 research that I'm doing. The core of the model was pretty simple, but it really could be a piece of a much larger model. And so I think as we get better about understanding how to put together these sorts of tools, we should have a more coherent understanding of the world and maybe be able to make better decisions about it. I sort of think that's the key is we want to make decisions. So...
Jon Krohn: 01:46:39
Super cool answer, Rob, but I'm afraid it was the wrong one. We were learning, flying cars. Flying cars it was a correct answer.
Rob Trangucci: 01:46:48
Yeah.
Jon Krohn: 01:46:48
Tough one. You would have won a prize. So, all right. So do you have a book recommendation for us? So to leave us with, to help people be able to solve the problems of the future, as we have more sensors, more data and build bigger models, more complex models that can make bigger inferences. Bayesian statistics is going to play a big role in that. I have no doubt it's going to play a bigger and bigger role. So I'd love a Bayesian stats book recommendation, but any book recommendation you have would be greatly appreciated.
Rob Trangucci: 01:47:22
Okay. I have a few different books suggestions. The first is, so I think Statistical Rethinking by Richard McElreath is just a really good book on statistical modeling and Bayesian inference. I think that's a great place for people to start. Once you make your way through that. I think something like Bayesian Data Analysis, the third edition by Gilman and others. There's a lot of authors on that book. I think it's Gilman, Rubin, Stern, Dunson, Atari. I think it's...
Jon Krohn: 01:48:31
I think we'll find it. We'll be able to find it [crosstalk 01:48:34] if you give us one more author.
Rob Trangucci: 01:48:40
[crosstalk 01:48:40] I think the Stan manual's very good. It is book length and it's about age. You can open up the manual and look at the model classes to get a sense for how these models are written in Stan. Non-stat book recommendation, I will go with, I've been reading a lot of Eric Larson recently. And so I recently finished Dead Wake and The splendid in the vile, both of which I loved. And it's just a fun book to read. Yeah. So I would recommend everybody.
Jon Krohn: 01:49:30
Great recommendations, probation stats books, and fiction books, I guess Erik Larson?
Rob Trangucci: 01:49:37
Well they are there. I think they're going to be in the history section and, but it's a book that makes history come alive. He does very deep research about usually like a single historical event. And he follows a bunch of different characters. Some of whom are pretty marginal, but for whom he has usually like a lot of correspondence, like primary source material, and then sort of puts them in context in, so Dead Wake is about the sinking of the Lusitania and Splendid in the vial is about the first year that Winston Churchill is in office and how he navigated the very beginning of World War II. They're both great books.
Jon Krohn: 01:50:32
Super cool. Yeah. I was unfamiliar with them and I now need to apologize for suggesting The Virgin.
Rob Trangucci: 01:50:38
That's okay.
Jon Krohn: 01:50:38
All right. It's a final question, Rob. It's been an epic episode. I'm sure viewers now know that you are a endless spring of useful information. So how can they follow you? How can they stay up to date on your latest?
Rob Trangucci: 01:50:56
So I have a Twitter handle. I have a Google scholar page and my LinkedIn, I think is a good way to find me and get in touch. [crosstalk 01:51:13] Yeah.
Jon Krohn: 01:51:16
And if people want to do a Stan workshop with you in the future.
Rob Trangucci: 01:51:20
Yeah.
Jon Krohn: 01:51:20
It sounds like the best way to do that is to subscribe to something like Jared Landers, lander analytics newsletter, and, or his open statistical programming meetup, which is how I had your name, come across my screen in the last few months. So great.
Rob Trangucci: 01:51:40
Yeah.
Jon Krohn: 01:51:42
Awesome. Rob, this has been such an amazing conversation. I hope we'll have the chance to do it again sometime on the show. Thank you for taking the time and I'll catch you again soon.
Rob Trangucci: 01:51:51
You welcome.
Jon Krohn: 01:51:52
Thank you.
Rob Trangucci: 01:51:53
Thanks for having me. It was great.
Jon Krohn: 01:51:59
Well, I'm pretty sure that was the longest SuperDataScience episode yet. Rob and I were having so much fun discussing content. I thought you might find interesting, that I simply kept the filming session rolling. We covered an absolute ton today, including what Bayesian statistics is, its history, how it differs from frequentest and machine learning approaches to data modeling and vision stats, particular utility. When you have prior information that could be helpful in allowing your model to find an optimal solution. Such as, in the EPOD base project Rob is involved with, to model missing ethnicity data associated with the epidemiology of COVID-19. Rob also contrasted the older Gibbs sampler with newer more efficient Bayesian optimization algorithms like no U-turn sampling or nuts for short. Rob dug into his favorite open source tools for performing vision inference, including the Stan project that he's contributed heavily to, and the BRMS and our stand arm packages that he recommends for folks taking their first Bayesian steps.
Jon Krohn: 01:53:07
Rob also gave us an overview of what a PhD in stats is like, why you may want to pursue one yourself. If you're passionate about challenging statistical problems and the post PhD possibilities that are increasingly blurring the academic and industry realms. As always, you can get all the show notes, including the transcript for this episode, the video recording, any materials mentioned on the show, the URLs for Rob's LinkedIn and Twitter profiles, as well as my own social media profiles @superdatascience.com/507. That's superdatascience.com/507. If you enjoyed this episode, I would of course greatly appreciate it. If you left a review on your favorite podcasting app or on the SuperDataScience YouTube channel, I also encourage you to let me know your thoughts on this episode directly by adding me on LinkedIn or Twitter, and then tagging me in a post about it. Since this is a free podcast.
Jon Krohn: 01:53:59
If you're looking for a free way to help me out, I'd be very grateful. If you left a rating of my book, Deep Learning illustrated on Amazon or Goodreads. Gave some videos on my personal John Crone, YouTube channel, a thumbs up or subscribe to my free, spam free and content rich newsletter on johncrone.com. To support the SuperDataScience company that kindly funds the management, editing and production of this podcast. Without any annoying third-party ads, you could create a free login to their learning platform at superdatascience.com or consider buying a usually pretty darn cheap Udemy course published by Ligency, a SuperDataScience affiliate such as my own mathematical foundations of machine learning course. All right, thanks to Ivana, Jaime, Mario and JP on the SuperDataScience team for managing and producing another incredible, epic episode for us today. Keep on rocking it out there folks, and I'm looking forward to enjoying another round of the SuperDataScience podcast with you very soon.
Show all
arrow_downward