Podcasts SDS 681: XGBoost: The Ultimate Classifier, with Matt Harrison

72 minutes
Data Science, Machine Learning, Python

SDS 681: XGBoost: The Ultimate Classifier, with Matt Harrison

Subscribe on Apple Podcasts, Spotify, Stitcher Radio or TuneIn

In this technical episode, Matt Harrison returns to the podcast to discuss XGBoost, the leading machine learning library for regression and classification. As a data scientist, understanding how to fine-tune XGBoost’s hyperparameters and identify its ideal modeling situations is crucial to unlocking its full power. Listen in as Matt demystifies these concepts and shares insights that will benefit the toolkits of all data scientists out there. It’s one you won’t want to miss!

Thanks to our Sponsors:

Interested in sponsoring a SuperDataScience Podcast episode? Visit JonKrohn.com/podcast for sponsorship information.

About Matt Harrison

Matt is a world-renown expert on Python and Data Science. He has a CS degree from Stanford University. He is a best-selling author on Python and Data subjects. His books: Effective Pandas, Illustrated Guide to Learning Python 3, Intermediate Python, Learning the Pandas Library, and Effective PyCharm have all been best-selling books on Amazon. He just published Machine Learning Pocket Reference and Pandas Cookbook (Second Edition). He has taught courses at large companies (Netflix, NASA, Verizon, Adobe, HP, Exxon, and more), Universities (Stanford, University of Utah, BYU), as well as small companies. He has been using Python since 2000 and has taught thousands through live training both online and in person.

Overview

If you haven’t tried XGBoost yet, you’re missing out! As Matt Harrison explains, XGBoost is an ensemble decision tree approach that affords extremely high classification accuracy and generalizes very well to new data. While XGBoost is the go-to library for large quantities of tabular data, there are situations where it may not be the best option. Among them, Matt highlights cases like computer vision, natural language processing and non-tabular data (unless it’s been pre-processed). And if you’re interested in maximizing XGBoost’s efficacy even further, Matt recommends fine-tuning hyperparameters such as model depth, regularization, class weights, number of estimators, and learning rate.

Ultimately, Matt offer listeners a simple formula: you should consider XGBoost as your first choice when you’re working with a large quantity of tabular data, complete model interpretability is not essential, and you’re not looking to minimize model execution time. Jon and Matt also get into the finer of details of XGBoost, revealing the “secret sauce” behind it and recommending his favourite Python libraries for XGBoost-related tasks. These include pandas for data preprocessing, scikit-learn for data pipelining, Yellowbrick for visualizing model performance, and XGBFIR for gaining insight into feature interactions.

Despite its technical nature, this episode also sees Matt emphasize the critical role that communication plays in all data scientist roles and shares his best practices for communicating XGBoost results to non-technical stakeholders. He recommends avoiding the use of technical jargon, and implementing vocabulary that would either be valuable for the target audience or using terms that they’re already familiar with. From start to finish, this episode provides valuable insights and practical advice for data scientists looking to maximize the power of XGBoost for accurate classification models.

In this episode you will learn:

Matt’s book ‘Effective XGBoost’ [07:05]
What is XGBoost [09:09]
XGBoost’s key model hyperparameters [19:01]
XGBoost’s secret sauce [29:57]
When to use XGBoost [34:45]
When not to use XGBoost [41:42]
Matt’s recommended Python libraries [47:36]
Matt’s production tips [57:57]

Items mentioned in this podcast:

Pathway
Posit RStudio
Anaconda
MetaSnake
SDS 557: Effective Pandas – 2022’s most popular episode
XGBoost Library
Effective XGBoost by Matt Harrison (20% discount for SDS podcast listeners already applied)
HyperOpt library
Pandas
scikit-learn
Yellowbrick
XGBFIR
CatBoost
LightGBM
Show Me the Numbers by Stephen Few
Jon’s Podcast Page

Follow Matt:

Follow Jon:

Episode Transcript:

Download The Transcript

Podcast Transcript

Jon Krohn: 00:00:00

This is episode number 681 with Matt Harrison, Many-Time Bestselling Author of Books on Python. Today’s episode is brought to you by Pathway, the Reactive Data Processing framework, by Posit the open-source data science company, and by Anaconda, the world’s most popular Python distribution.

00:00:20

Welcome to the SuperDataScience podcast, the most listened-to podcast in the data science industry. Each week, we bring you inspiring people and ideas to help you build a successful career in data science. I’m your host, Jon Krohn. Thanks for joining me today. And now let’s make the complex simple.

00:00:50

Welcome back to the SuperDataScience podcast. I’m joined again today by Matt Harrison, who was my guest for 2020’s most popular episode. Back then we focused on the Pandas library for working with tabular data in Python. Today he’s back for an episode that’s again, focused on working with tabular data, but specifically on performing maximally powerful machine learning on those data with the XGBoost library. Matt is definitely the person you want to have teaching you about this technique. He is the author of seven best-selling books on Python and machine learning. His most recent book, Effective XGBoost, was published in March. Beyond being a prolific author, he teaches “Exploratory Data Analysis with Python” at Stanford University, through his consultancy MetaSnake, he’s taught Python at leading global organizations like NASA, Netflix, and Qualcomm. Prior to focusing on writing and education, he worked as a CTO and a Software Engineer, and he holds a degree in Computer Science from Stanford.

00:01:47

Today’s episode will appeal primarily to practicing data scientists who are keen to learn about XGBoost, or keen to become an even deeper expert on XGBoost by learning about it from a world-leading educator on the library. In this episode, Matt details why XGBoost is the go-to library for attaining the highest accuracy when building a classification model. He talks about modeling situations where XGBoost should not be your first choice. He talks about the XGBoost hyperparameters to adjust, to squeeze every bit of juice out of your tabular training data. And his recommended library for automating hyperparameter selection. He provides his top Python libraries for other XGBoost-related tasks such as data preprocessing, visualizing model performance, and model explainability. He also talks about languages beyond Python that have convenient wrappers for applying XGBoost and best practices for communicating XGBoost results to non-technical stakeholders. All right, you ready for this super practical episode? Let’s go.

00:02:49

Matt, welcome back to the SuperDataScience podcast. It’s wonderful to have you here. I imagine as usual, you’re calling in from Utah.

Matt Harrison: 00:02:57

I am, yeah. Thanks for letting me come back on.

Jon Krohn: 00:03:00

Yeah, my great pleasure. Yes, you were on the show last year in episode number 557, and you had the distinction, Matt, of having the most listened to episode in 2022. So thank you very much for contributing that.

Matt Harrison: 00:03:14

Yeah, I’ll blame my robot hoards that I control.

Jon Krohn: 00:03:18

Well, it is true that actually you did have a very, very popular audio-only podcast. I’m not sure if you would be number one for sure, but the thing that really took off was typically our episodes, only a few percent of our listeners view the video format, but for your episode it was 50:50. So a huge, the YouTube version of your episode on Effective Pandas really took off. And so that’s, that’s, you had certainly been in the top 10 anyway, but that was what made you number one. So, I guess, yeah, you’ve got the YouTube robots worked out.

Matt Harrison: 00:04:02

Awesome.

Jon Krohn: 00:04:02

And since then, I actually saw you in person at ODSC West in San Francisco in the I think that was like early November of 2022. And yeah, really nice to see you in person there. We actually, there are a bunch of of the most popular episodes of SuperDataScience last year. So guests like Sadie St. Lawrence, Ben Taylor, I know we got a photo altogether. And then that episode led to me landing several other great guests that we’ve had already so far this year. We actually did an episode just a few weeks ago with Stefanie Molin, whom I met at ODSC West. It was episode number 675 and hers is also a Pandas episode. But it was kind of specifically focused on Pandas for data analysis.

Matt Harrison: 00:04:49

Awesome. Yeah.

Jon Krohn: 00:04:50

Yeah, so we know those Pandas episodes do well. So what have you been up to since the last episode, Matt?

Matt Harrison: 00:04:59

Yeah, mostly doing a lot of training. So I do corporate training, help people learn Python, and tell lies with data. I do live in Utah, so it’s been snowing. It’s been a record-breaking snow year, so getting in a little bit of skiing. And yeah in the meantime, I wrote another book as well, so.

Jon Krohn: 00:05:20

Yes, and that is our main reason for having you on today. So last year your episode was called Effective Pandas. Cause we were talking about your Effective Pandas book that had come out, and this time we’re focused on your book Effective XGBoost – Tuning, Understanding and Deploying Classification Models. It just came out. Congrats on that, Matt. Huge accomplishment getting another one of these books out. And very kindly, you’ve also provided a promo code for our listeners, so I’ll be sure to include that in the episode show notes as well as in the social media posts that I make about this episode. And furthermore, Matt has very kindly offered five free digital copies of his new book Effective XGBoost. So the way that that’ll work, like the I don’t know if contest is the right word, race. The races that we’ve had for free stuff that our guests have given away in the past, the way that this will work is, so these episodes come out on Tuesday mornings, New York time, these guest episodes. And so I’ll make a post on my personal LinkedIn account. We post on several different accounts, but on my personal LinkedIn account, I will post that this episode is out and kind of highlight the key topics in the episode. And in that, I’ll mention how the first five people to ask for a free digital copy of Matt’s book, will get one. And so yeah, if you’re listening to this just after it came out, you might still have a chance.

Matt Harrison: 00:07:01

Good luck.

Jon Krohn: 00:07:03

So you’ve written half a dozen books, not including second editions, and so you’re a prolific author. Most of your content has been on Python, particularly, you’ve had several books on Pandas. Why was now the right time to write a book about XGBoost?

Matt Harrison: 00:07:24

Yeah, and I guess this goes back to, you know, with programmers there’s often this notion of bikeshedding that when programmers do something, they want to do it their way. And there’s often the story of bikeshedding is they want to make a bike shed and it has to be pink versus like the normal bike shed. And as someone who’s an instructor, I’m, like I said, I spend a lot of time teaching this material to people. I wanted a book that taught kind of what I felt was important. And so a lot of books around XGBoost tend to be about here’s the algorithm and then here’s how you make a model. And that’s kind of the end of it. And I wanted more of, let’s look at some exploratory data. Let’s go into making the model, then let’s talk about various things that you can do with the model to evaluate it, tune it. But then also there are a few topics that I find that people don’t discuss when they’re talking about XGBoost that I thought were interesting. And then also going beyond that, looking at things like deployment as well. And so it is going from sort of start to end, not just like, here’s the algorithm, but like, what do you do after you have that model created as well.

Jon Krohn: 00:08:46

Nice. So, yeah, so that sounds really great. It sounds like there was a big gap for you to address there. And particularly considering production deployment is so important today, there are so many resources out there that teach data scientists how to be just creating a model, and then you don’t know, you don’t know how to actually use that in practice. So that is certainly a great gap. For our listeners who aren’t aware of what XGBoost is, Matt, just fill us in kind of generally on what this algorithm is and maybe how it’s different from other kinds of models out there for classification problems.

Matt Harrison: 00:09:25

Sure. Yeah. So, so XGBoost stands for eXtreme Gradient Boosting. And it is a tree-based algorithm. So, generally when I’m teaching machine learning to folks, I kind of go at it from, a lot of people start with logistic regression or linear regression, depending on if they’re teaching classification or regression. I like to go at it actually from trees. So I will teach decision trees first, and I’ll start off with the decision stump, which is basically an IF statement. And it’s saying, if you look at all the columns in your data and we’re trying to predict a label for something which value in each column, we’ll give you the cleanest split. And if you do that, you that you can make a tree out of that, which is basically an IF statement, but if it’s just making one decision, we call it a stump.

00:10:19

And then you can recursively do that with your decision tree and look at each of those divisions and then revisit the features and make those splits and then you get a decision tree. So, so a decision tree is great in that you can explain it very well. I mean, it basically will capture the information that is in your data and you can let it overfit and it is what is there. The problem is like with a lot of things, machine learning overfitting is a problem. And so you need to be aware of that and handle that. And so from there, I like to say, okay, well how would you overcome some of these issues with, with decision trees? And one thing might be, well, you could make different decision trees and you could train them on different portions of the data.

00:11:11

And that’s sort of like this question, like, how many people do we add to the jury? Right? And so if, if you are looking at adding people to a jury and each of them sort of has a different background and point of view, well if each of them have a greater than 51% chance of making the right verdict, it would be good to add all of these people. And that that algorithm is basically a random forest. And then a decision or, or a gradient boosting is saying, we’re going to take a decision tree and it’s going to have some error, and then we’re going to correct for that error and make another tree that boosts or fixes that error. And we can keep doing that process. So analogy I like to use is that these are sort of like golfing. If, if you were golfing, a decision tree is like having one chance to hit tee off and hopefully it goes in the hole.

00:12:04

Random forest is like saying, you and 20 of your best friends all can tee off and we’re going to, wherever the ball lands, we’re going to average all those together and that’s where the ball is. Generally, that would, I’m not a very good golfer, so that would probably be good for me. And then a gradient-boosted decision tree would be like, okay, you get to hit the ball, but then you get to hit again, you get to hit again, and you can hit it however many times you want. And each time you’re correcting that error and getting it closer to the hole. And so that’s the, I like to say that’s sort of the basic idea behind these gradient boosted decision trees is that they can take a relatively simple or weak model and then they can keep correcting that error and eventually they will do a decent job.

Jon Krohn: 00:13:26

Are you moving from batch to real-time? Pathway makes real-time machine learning and data processing simple. Run your pipeline in Python or SQL in the same manner as you would for batch processing. With Pathway, it will work as-is in streaming mode. Pathway will handle all of the data updates for you automatically. The free and source-available solution, based on a powerful Rust engine, ensures consistency at all times. Pathway makes it simple to enrich your data, create and process machine learning features, and draw conclusions quickly. All developers can access the enterprise-proven technology for free at pathway.com — check it out!

00:13:29

Yeah. So to kind of rewind and recap what you covered there, so when we have a classification problem, so when we’re trying to use a machine learning model to predict a specific category or a statistical model to predict a specific category, so a common example, one that I’ve taught with is there’s this Titanic dataset that’s really overused where every row represents somebody that was on the Titanic and you have this binary label as to whether they survived the Titanic or not. And then you have, so every row is a person, and other than the column that says whether they survived or not, you have all these other columns of information about the people. So like, were they in first class? Were they male? Were they female? How old were they? And so you can use a decision tree to say like the decision tree will try to go, okay, like, what, which of these predictor variables is most predictive of whether somebody survived the Titanic or not?

00:14:31

And so the I it could be something like being female I think was a really good predictor. Yeah. So like females were much more likely to survive. So the first branch in the tree is branching on gender. And because you’re like, if they were the female gender, they’re, they were very likely to survive, male were not. And then you can go from there and say, okay, within females and separately within the males, what was the next biggest factor? Was it age? Was it fare class? And you continue down the decision tree like that. And so that gives you like a single decision tree. But as you were saying, these decision trees can be highly prone to overfitting to specifically the training data, meaning that they don’t generalize well to data that they haven’t seen before.

00:15:14

And so a common approach, you mentioned that they’re the random forest idea of having multiple jurors. So you can have not just one decision tree, but as many decision trees as you like. So you might have ten or a hundred. And and for each of those decision trees in the random forest, you end up with slightly different answers because you, you’ll like remove some of the cases in the data, or you’ll remove some of the predictor variables randomly from each of the decision trees. So you end up with ten or a hundred or a thousand or whatever different decision trees that together form this random forest. And that can resolve the overfitting problem that we see with a single decision tree. However, this XGBoost approach that you described of having multiple of these decision trees, where instead of just randomly removing data or randomly removing parameters like we do with a random forest, with XGBoost you’re specifically honing in on where does my decision tree make mistakes?

00:16:11

Your golf ball analogy was awesome, I’ve never heard that one before. Like this idea of like having multiple swings to just get you closer to the perfect shot, closer to the perfect prediction with each subsequent tree in the boosted model. And so that, that concept ends up being really powerful. And so when you see these kinds of online competitions of what model gets you the highest accuracy on some task depending on the kind of task it could, it will either be like a deep learning approach, if it’s kind of like a machine vision problem or a natural language processing problem, it might be a deep learning approach. But if it’s working with tabular data, you can bet that XGBoost is going to win the competition.

Matt Harrison: 00:16:55

Yeah, yeah. It generally is up there. And as, a random aside, I found out recently that my wife’s like second great-uncle was on the Titanic, and so-

Jon Krohn: 00:17:08

Oh really?

Matt Harrison: 00:17:08

Yeah, he died, but his wife survived. Right? And so that, that one, one of the key predictors in the Titanic, and that’s kind of a morbid example, it’s like predicting whether people will die if you keep sending a Titanic out and crashing it. But like, yeah, he, he died and his wife survived, right? So, so in that case, the gender was important for survival.

Jon Krohn: 00:17:30

Yeah, that was in the teaching that I do when I teach decision trees, I create, based on the James Cameron film Titanic, there’s like the two main characters of Jack and Rose. And so I kind of like, we create the decision tree and then I’m like, okay, so if we had a character with the attributes of Jack, what’s his probability of survival? And it’s like, it’s very unlikely that Jack would survive, whereas Rose’s probability, based on her demographic factors being in first class, being female, very high probability of surviving. And indeed that is what happens in the film.

Matt Harrison: 00:18:04

Yeah, no, it must be true.

Jon Krohn: 00:18:06

Spoiler alert. So yeah, so those are, yeah, two small anecdotes about the Titanic that demonstrate the power of decision trees. So yeah, but XGBoost these extreme gradient boosted trees, they really, I mean they’re, I can see why it would be compelling to write a whole book about this because of how frequently we see these XGBoost models winning competitions.

Matt Harrison: 00:18:36

Yeah. Sup super powerful. And they tend to, out of the box, they tend to do a decent job, right? So, so with, with little bang, a lot of bang for, how would I say this?

Jon Krohn: 00:18:48

Minimal effort.

Matt Harrison: 00:18:49

Yeah, minimal effort. A lot of, a lot of juice for the squeeze.

Jon Krohn: 00:18:53

Right? Right. Exactly. And so then I guess the implication there is that the kind of the default hyperparameters work pretty well. Do you want to like, dig into some of the key model hyperparameters that we might want to work with when we’re configuring an XGBoost model?

Matt Harrison: 00:19:08

Yeah. So, so you’ve got hyperparameters that deal with the tree structure, right? And so generally you want to make what people call weak trees. Weak trees are trees that don’t go very deep. And then you want to have the subsequent trees correcting those issues. So you can, you can deal with how deep the trees go or you can deal with how many samples or rows would get split into a level and stop splitting once those get to a certain size. So that controls the tree structure. There’s some regularization hyperparameters that you can control as well. So that would make it so that you pay less attention to certain columns and prevent overfitting that way. There are some hyperparameters that let you deal with model weights or classification weights. So if you’ve got imbalanced data, you can deal with that.

00:20:05

And then there are other, like another hyperparameter that’s common is, is the number of trees or estimators that you have. So again, that’s how many times you hit the ball. And that’s related to another hyperparameter, which is called the learning rate. So the learning rate is, you can think, if we go back to our golfing metaphor, if you’ve ever golfed and you try and hit the ball as hard as you can, sometimes that doesn’t work very well. I, when I, when I was taught to golf, it was like hit the ball like at 80%. So you’re consistently hitting it. You’re not like squeezing as hard as you can, right? And so, you know, if you’re hitting the ball as hard as you can, you might overshoot it sometimes, and so you might overshoot the hole. And so this learning rate is saying, well, we, in the case of a decision tree, we don’t even have, if we can hit as many times as we want, I mean, we could take our putter out there and we could just keep putting the whole time as long as we have more trees and you’d eventually get to the end.

00:21:11

Now, there are pros and cons to that, right? You might have more trees, so that might take a, when you need to make a prediction, that might take a little bit longer, but you can make a decent model. And so there are a lot of these things in machine learning, there’s trade-offs and balances. And so with evaluation techniques, you can figure out, you know, what, what combination of hyperparameters will give you a decent model. I’ve found that XGBoost actually in my experiments, slightly overfits out of the box. And even considering that it does overfit out of the box, it tends to give better results than a lot of other models do. But with a little bit of tuning, we can get it to actually perform better than the out of the box model.

Jon Krohn: 00:21:58

What kind of tuning would you do? So, so if we have this slight overfitting typically out of the box, then what are your kinds of key next steps to reign that in?

Matt Harrison: 00:22:09

Yeah, so, so it depends on, on what you’re, what I guess computers you have access to and how much time you have. Like in the book, I show an example of doing, using the HyperOpt library. And the nice thing about the HyperOpt library, in contrast with like a grid search SciKit-Learn. So SciKit-Learn is a popular Python library for, for doing tabular data machine learning. And SciKit-Learn has what’s called a grid search. And the idea there is, you have these hyperparameters that control how the model works, and then you can say, okay, these are the specific hyperparameters that I want to evaluate. So for depth, you might say like, maybe we want to look at a stump, right? So we, we let the depth go to one. Maybe we, we say the depth can go from like 1 to 10. Or maybe we also include unbounded in there as well.

00:23:07

And then, you know, if there’s five hyperparameters and each of them have 10 different options for them, you’ve got this combinatoric explosion. Every time you add a new hyperparameter, you’re basically doubling the time that it takes to evaluate and see which combination is the best. And so you can use some other libraries, like HyperOpt is one that I like to use and the idea with HyperOpt is that rather than specifying here’s the 10 different options, I provide a distribution. And it’s going to do some Bayesian modeling to say, okay, what, what, how did my performance result when I chose this value from the distribution? And if it’s okay, I’m going to exploit that area and try other values around that. But every once in a while it will do an exploration what will sort of jump back and try some other place that it might not have found to see if, if there might be an, it might be like in a local minimum or something like that.

00:24:11

And so HyperOpt is a library that you can basically set these distributions and you can say, this is how many times I want to run, and then you can track their performance as it’s going over time. That’s one of the tools that I like to use. And then if, if you don’t have a lot of time, I’ve got in my book, I show Stepwise tuning. The idea there being that instead of doing this combinatoric explosion, we might sacrifice, maybe we will get into a local minimum, but we’re going to say, let’s just look at hyperparameters that deal with the tree, first of all, and try and just adjust tree hyperparameters for a little bit. And since there aren’t that many of them, it might not take very long. And then we’ll look at regularization hyperparameters and just optimize those, rather than look at the combination of both of those. And I found that, again, depending on your data size, that can save you a lot of time. You can get a decent model that is better than your out-of-the-box model with relatively little effort doing that.

Jon Krohn: 00:25:15

Super cool. Those are great tips. I love this idea of using the HyperOpt library instead of doing a grid search over, as you say this, like very rapidly expanding possible number of hyperparameter situations to be exploring using the Bayesian model to use a probabilities distributions to hone in on the best hyperparameter values to have. And it’s cool that it, that it’ll kind of retest as you, like, it’ll use that probability distribution to bounce out a bit from where it’s already kind of found to be the right spot to explore a little bit more. That’s really cool. In deep learning, we, which I have a lot more experience with than XGBoost. In deep learning, we have variable learning rates. So when you gave that analogy earlier, Matt, about the different types of golf clubs, I love that I hadn’t had that visualization before.

00:26:16

So this idea of learning rate being like having a putter for a very small learning rate, or a driver for a very large learning rate. And so, and also the key thing there, is that like, when you’re using the driver, you have to be taking a full swing. Like you’re, like, you’re cranking it, you’re trying to hit it like 300 yards every swing with a driver. So that even when you’re on the green very close to the pin, you’re like swinging this driver and hitting at 300 yards and hoping that you’re going to like, get it in the hole, which is obviously going to be hard. So the, there’s this kind of, the analogy works really well where if you have the time to wait while using the putter, you know, that’s going to be inefficient as you start from the tee, but once you make your way to the hole, you’re, you’re going to find it easy to sink it into the hole. Whereas the driver you’re going to get to the hole quickly, but then it’s going to be really hard to get it a very precise, a very precise model.

00:27:18

So in deep learning, we have this idea of variable learning rates, which ties in perfectly to your golf analogy, where you start with a driver and then as you get closer to the hole, you switch to irons and then you switch to a putter. So you’re gradually adjust, you’re gradually reducing your learning rate as your model gets closer to having the right answer. And so is there, I, you know, this was just a big exploratory tangent and I have no idea, is there a concept of this same kind of variable learning rate concept with XGBoost?

Matt Harrison: 00:27:51

So I don’t believe XGBoost supports variable learning rates, but I think Jon, the key thing to, one of the key things to recognize that unlike random forest where each of those trees is trying to like, hit the ball in the hole, what a successive tree in gradient boosting is doing is trying to correct the error, right? So, so it’s not doing, it’s not repeating the same hit, it’s correcting that error. So, it is kind of like we can choose whatever club we want to for that first hit, and then we’re going to use different clubs because we’re, we’re looking at our data from a different point of view at this point.

Jon Krohn: 00:28:31

Nice. That’s a really great answer. So this kind of, this, yeah, the variable learning rate doesn’t matter as much. It’s really just for that right off the tee, what are we going to whack it with to get going?

Matt Harrison: 00:28:42

Yeah, I mean, you can adjust the, l show in my book, you can adjust the learning rate. And basically what that’s doing is just like making, you know, if you say the learning rate is 0.5, like however hard you hit off the tee, you’re going to hit half as far, basically, right? And so yeah, that, that is something that you can tweak, right? And it might be the case that if, if you keep that cranked up, you’re kind of, you might be doing some ping pong there versus if it’s slowed down a little bit, it might take a little bit longer, but get to the get to the whole or give you a better result in the end.

Jon Krohn: 00:29:18

Nice. This episode is brought to you by Posit: the open-source data science company. Posit makes the best tools for data scientists who love open source. Period. No matter which language they prefer. Posit’s popular RStudio IDE and enterprise products, like Posit Workbench, Connect, and Package Manager, help individuals, teams, and organizations scale R & Python development easily and securely. Produce higher-quality analysis faster with great data science tools. Visit Posit.co—that’s P-O-S-I-T dot co—to learn more.

00:29:56

So is it this, is it just this capacity to be taking the errors, taking the residuals and refining on those? Is that fundamentally what makes XGBoost work so well and allow it, I said XGBoost, really funny there, but that allows XGBoost to work really well. Is, is that fundamentally it or are there other ingredients in the secret sauce?

Matt Harrison: 00:30:25

Yeah. So we, we do have, you know, that correction of, of the residuals. And again, generally what, what folks are doing is they’re, they’re making what they call weak model, a tree that doesn’t go very deep. It’s not overfitting, and then they’re correcting the error of that. But XGBoost has some other benefits as well. Like I said, it does have the ability to regularize that. So there are various hyperparameters that we can tune to regularize that. It does also have the ability to work with missing data and categorical values, which might not seem super important. But if you have a, you know, a lot of us have data that’s kind of messy going into that, and that I’m, I’m not saying that you should skip like pre-processing or cleaning up your data, but that can help eliminate a lot of those questions where if you have missing data, what do I do with my missing data? Well, XGBoost can just take care of that for you.

00:31:21

You can imagine that decision. When it’s making a decision it can just say, okay, well let’s take into account whether it’s missing as well. And then another thing that XGBoost has going for it is that it is optimized to leverage multiple cores. So if you’ve got a beefy machine, you can have it run very quickly. And you might think, how does it optimize for multiple cores when you have to keep putting? Well basically it can evaluate in the same tree, the multiple steps in the same tree at the same time. So, so you get through a single tree a lot faster than you would otherwise.

Jon Krohn: 00:32:00

Nice.

Matt Harrison: 00:32:01

So the combination of those gives us like we’ve said, an algorithm for tabular data that works really well out of the box, doesn’t require a lot of pre-processing, though obviously with pre-processing generally your model will improve. And so I think it’s a great tool to have in your tool belt and to be aware of, again, because it is a lot of bang for the buck.

Jon Krohn: 00:32:27

Nice. Super cool. Yeah. So in addition to refining on residuals, other reasons why XGBoost works so well relative to other algorithms for classifying tabular data, include that it works well with missing data and I can’t understate how important that is. Because if you have 20 possible columns of predictor variables that the decision tree could be considering using at any given point. It might be the case, so going back to the Titanic example, again, there could be rows where, okay, we knew the age of the passenger, but we didn’t know their fare class or vice versa. And so it could end up being the case that over the course of the hundreds or thousands of people that were on the Titanic that, you know, there’s one data point missing for half of them, or maybe there’s one data point missing for three-quarters of them.

00:33:20

And so if you are not able to handle these missing data points, either you need to be pre-processing them, like you say, like permuting, what these missing values could be. But it’s really cool that XGBoosts can handle those missing data right out of the box. Regularization also obviously huge and probably another point as to why XGBoost does so well, because XGBoost isn’t just overfitting to training data, it does well on data that it hasn’t seen before, which in these competitions, that’s basically the way the format always works is they’re like, “Hey, here’s your training data set. We’re going to evaluate it on a dataset that you’ll never be able to see.” And so that regularization is key. And obviously in the real world, not just in competitions, that is machine learning models are dealing with data that they haven’t seen before. So it needs to be able to regularize well, to generalize well, and super cool that XGBoost is optimized to leverage multiple cores. So then obviously if you, if you have access to a lot of compute or you use a cloud instance to access a lot of compute, you could crunch through a lot of data effectively then that means as quickly as you want, you can just scale up your compute to get an answer more quickly.

Matt Harrison: 00:34:27

Exactly. Yeah. It will also run in GPUs as well, so, so-

Jon Krohn: 00:34:32

Oh, really?

Matt Harrison: 00:34:33

Yeah. So, so if you, if you’ve got beefy GPUs as well, you can leverage XGBoost there as well.

Jon Krohn: 00:34:40

Nice. I did not know that. Yeah, that is super cool. So when are particular circumstances when you would use XGBoost? So we already talked about tabular data, so it sounds like basically anytime you have tabular data would be a time to use it. Are there other circumstances as well, or is that kind of the that’s, that’s the key time?

Matt Harrison: 00:35:00

Well, tabular data, is sort of key, but let me maybe go into some nuance of like, I mean, in the Python world, you’ve got a bunch of options like SciKit-Learn implements a bunch of algorithms that you could use. And I will note that there are other extreme gradient boosted libraries as well like CatBoost or LightGBM, which at a 20,000 foot point of view are very similar to doing what XGBoost is doing. But some of the reasons why you might want to consider one of these extreme gradient boosted decision tree models is if, if you have complex relationships in your data. So a nice thing about decision trees is that they can capture non-linear data.

00:35:49

And an example I like to use for a while I worked in helping create solid-state storage devices. So this is like NAND flash SSD type devices, and I’m not a hardware person, but I found out that there’s this oddity where if you have a piece of NAND flash, it has a tendency to die during infancy. And then if you get it passed into like the teenage years, it will work fine and then the electrons get stuck again in the end. So there’s like this bathtub curve of like, at the front it, there is a chance that it might not work, but once you get it past a certain point, it’s basically going to work until it gets to the end of life. And so, you know, if you’re using like a logistic regression model, it can’t capture that non-linearity if you’re looking at age of the device, right? It can just say as the age of the device goes up, either I’m more prone to failure or less prone to failure, but given a single column that like logistic regression wouldn’t be able to make a prediction from that. Whereas a decision tree could say, okay, I’m first going to look at the age, and if it’s less than some amount, then I’m going to say maybe it’s more likely, and if it’s greater than, and then it can look at the age again after it’s done that split and capture those non-linearities. So, so that’s, I think that’s super useful. This is a feature of decision trees in general. So random forests and decision trees have that.

00:37:22

Another thing that you can get by using a library like XGBoost is interactions. And so interactions are the, it’s, you know, going back to our Titanic model, it’s okay, are you a female? Well, there might be a difference between a female and first class and a female and third class, right? So we can split on gender, but then we can also split on class after that. And if we see in our trees that we often split on gender and then class after that, that would tell us that there might be a relationship between those two. And again, this is something that using logistic regression or some of these, other models can’t really capture relationships like that or interactions unless you would start encoding that into the data, right? Make a new column, that’s like, we’re going to take these two columns and multiply them together. Or for the example of our U-shape curve, bathtub curve, where you say like, okay, we’re going to make a new column, is the age less than teenage years and make another column with the age an indicator column for the ages greater than old age sort of thing.

00:38:27

So, as a data person, I find that super interesting that like, there are these relationships between the data, and these algorithms can find them for us and help us uncover them even [inaudible 00:38:44], we can look at the results and start understanding our data better from that. Again, XGBoost is a relatively performant model and so you might want, if you want good results, and you might, there might be some tradeoffs that you’re willing to make, XGBoost might be a model to use.

00:39:08

And then going back to that, you know, what’s in the data? I do kind of like to use XGBoost to go back and understand what’s going on with my data. And so I would say using XGBoost as a mechanism to do EDA or Exploratory Data Analysis, or if you think of like this machine learning life cycle of, we’re going to ask a question, we’re going to make a model, we’re going to look at the results, and then this might have a cycle in it or multiple cycles in it. And oftentimes I’ve found that like after I evaluate my XGBoost model I have insights into the data that I didn’t have initially that might help me make a better model after that.

Jon Krohn: 00:39:56

Yeah, really great points there. So just to kind of recap them. One of the, another one of the things that makes XGBoosto so powerful is that it automatically handles non-linearities and interactions. And so that’s actually, that’s similar to why deep learning can outperform regression models in other kinds of cases. So XGBoost is again, great for tabular data, whereas deep learning you might say, you know, if you’re working with image data or video data or natural language data, you might think, okay, deep learning is the way to go first. But with either approach XGBoost or deep learning, what makes them so powerful is that they handle these non-linearities, they handle these interactions between input features and the model fully automatically. Yeah. So super powerful.

00:40:39

And then to distinguish between deep learning and XGBoost, and another reason other than just the tabular versus non-tabular data kinda situation, is that XGBoost can be way more performant, not just on the training data, but actually in production as well. So when people are using ChatGPT today, they see like words coming out kind of one by one or characters coming out individually. And I mean, that is an extreme example, of a model with a lot of weights. And given how many weights are involved, it’s pretty remarkable that you can get answers in real time. But XGBoost in production, I’ve, any XGBoost algorithm that I’ve ever seen works instantaneously. There is no lag even on, you know, relatively inexpensive CPU hardware running on the server.

Matt Harrison: 00:41:37

Yeah, yeah, those are good points.

Jon Krohn: 00:41:39

So you’ve given us lots of great reasons for situations that we should be using XGBoost. I’ve touched on a situation where you might not want to be like, so working with image data, for example, natural language data. What other kinds of circumstances would XGBoost not be the ideal choice?

Matt Harrison: 00:41:59

Yeah, obviously like non-tabular data like you mentioned XGBoost isn’t going to work with that unless you pre-process your data and as you mentioned, your deep learning models are state-of-the-art performance or what people are using these days for image, video, audio data, that sort of thing. If you’ve got, going maybe the opposite direction, if you have really small data, right? Maybe you only have tens or a few hundred rows of data, you might want to go with a simpler model, maybe something like logistic regression. Another key thing that might impact that decision, and this is probably one of the biggest influencers, is this notion of interpretability. Can you interpret the model?

Jon Krohn: 00:42:53

I knew that was coming.

Matt Harrison: 00:42:55

Yeah. And so I, I’ve done some work in the finance sector, and a simple example is if we’re making a model to predict whether we should give someone a loan or not, right? And you could imagine if you’re a customer going into a bank and you’re applying for a loan and they say, “no, we’re not giving you a loan”, and you ask, “well, why not?” And they say “because the model said so.” That’s not going to be super satisfying for the customer. And so you could imagine that if you are in that situation where you want to be able to explain what went on, and if you were able to say, well, if you had $5,000 in an account and your credit score was 10 points higher, we would definitely give you a loan. That would be, okay, I didn’t get a loan, but I know what I need to do to get one.

00:43:48

And so if you had a model that was able to give you those answers, even though maybe it wasn’t quite as accurate at getting fraud, but it made it so you didn’t make all the customers mad and leave you for a different bank, that might be a trade-off that you’re willing to make. So that notion of interpretability, a lot of people would call XGBoost a black box model, meaning that it’s impossible to understand. I would, I would say it’s a dark gray model. I mean, if you’re, if you’re motivated, you can, you can, you can look at all the trees and you can print them out and you could walk through them, right? And you could, you could understand what’s going on there, but that’s not something that you’re going to do with a loan officer. They’re not going to sit down with you and say “well, let’s look at tree 1 through 582”, right? That, that’s, that’s just not going to work in a business context.

00:44:44

So, you know, a strict white box model that is completely interpretable might be a reason to choose a different model. Now, there are, there are some libraries that let us [inaudible 00:44:57] those and we’ll probably discuss those later. And then I guess one issue, you know, if you do have like thousands of trees, you know, there could be a speed issue that, and if you need like, super quick results, right, doing something like logistic regression is going to be hard to beat logistic regression because you’re basically just taking the inputs and multiplying them by a weight and summing that and that’s your answer. So it’s going to be hard if you need super quick inference or speed on that.

Jon Krohn: 00:46:18

Did you know that Anaconda is the world’s most popular platform for developing and deploying secure Python solutions faster? Anaconda’s solutions enable practitioners and institutions around the world to securely harness the power of open source. And their cloud platform is a place where you can learn and share within the Python community. Master your Python skills with on-demand courses, cloud-hosted notebooks, webinars and so much more! See why over 35 million users trust Anaconda by heading to www.superdatascience.com/anaconda — you’ll find the page pre-populated with our special code “SDS” so you’ll get your first 30 days free. Yep, that’s 30 days of free Python training at www.superdatascience.com/anaconda

00:46:20

Nice, yeah, that’s a really good point. I guess I haven’t used that many trees in XGBoost and I’ve never deployed into production, so I’m not the expert. I hadn’t thought about that, how, if you have lots of trees, you’re going to underperform relative to regression model. So if this was like a, you mentioned finance applications, there you were talking about loans where it’s critical to have interpretability, but another kind of finance case is like high-frequency trading where it’s like you have milliseconds or fractions of milliseconds matter in getting a trade ahead of somebody else. In that scenario, even just a little bit of lag associated with XGBoost might be too much.

Matt Harrison: 00:46:59

Yeah. And again, there might be a trade-off there. It’s like we might not be able to do the trade if it’s slow. And so we’re willing to take a model that maybe makes slightly worse trades because otherwise, we wouldn’t even be able to be in the game sort of thing.

Jon Krohn: 00:47:16

Right. Totally. So, cool. Great rundown of the kinds of situations when we might not want to use XGBoost. Obviously, non-tabular data, very small amount of data, if interpretability is essential. Or yeah, if you know, it’s a really high-performance speed situation. So you mentioned there are other kinds of Python libraries that might solve some of these problems. Let’s dig into that.

Matt Harrison: 00:47:41

Yeah, so I mentioned that XGBoost, you can use it with minimal pre-processing. I don’t generally recommend doing that, so I would recommend spending some time with your data before making your models. And I found that Pandas is a great library for doing that, and we talked about that in our other episode, but diving into our data, understanding what’s going on there is probably going to give you some insights before you start making your model that might come in useful. Another library that’s super useful is the SciKit-Learn library. And even though XGBoost is not part of SciKit-Learn, XGBoost has compatibility with SciKit-Learn. So if you are using SciKit-Learn pipelines to do some pre-processing, you can plug in XGBoost there. I will note that SciKit-Learn after XGBoost came out, did implement their HistGradient Boosting model, which is, is basically a SciKit-Learn version of XGBoost, that it’s going to have similar characteristics.

Jon Krohn: 00:48:49

So I remember from your Effective Pandas episode number 557, that one of your number one tips for data scientists is to be chaining methods. So does this mean that we can chain SciKit-Learn methods right in there with XGBoost?

Matt Harrison: 00:49:07

So, so how SciKit-Learn … SciKit-Learn has this notion of a pipeline, right? Where you can say, and it’s basically chaining operations. And so I will take my Pandas pre-processing code, and if you chain that you can with relative ease, write a class in Python that will be like your Pandas transformer that you can stick in the pipeline. So you can stick in that Pandas transformers the first step. And then you might have some SciKit-Learn things that maybe are doing like standardization or something like that. And then you could stick in the XGBoost model at the end of that and have that pipeline going through them.

Jon Krohn: 00:49:49

Nice. And when you say transformer there, in a lot of recent episodes on the show, we’ve been talking about transformer architectures, like the GPT series, but this is just a transformer like with a lowercase “t” just meaning something that transforms something.

Matt Harrison: 00:50:02

Yeah. The SciKit-Learn notion of transformer, not a deep learning notion of transformer where a transformer, SciKit-Learn notion that I’m taking some data in and transforming it and giving you data out, generally of the same shape or dimension that’s coming out of that. Some other models or some other libraries that I think are useful for people who are dealing with XGBoost – the Yellowbrick library. So the Yellowbrick library is a visualization library, and I’m a huge fan of visualization and find that using intelligent visualization can help me understand how my model’s performing and actually evaluate my models. SciKit-Learn has some visualization capabilities, Yellowbrick is a little bit more advanced there. Another-

Jon Krohn: 00:50:54

Nice, that is a new one for me.

Matt Harrison: 00:50:55

Yeah, Yellowbrick. Yeah. Another one that a lot of people haven’t heard of is a library called Xgbfir. So this is a library for looking at feature interactions. And again, this is that notion that in a decision tree, if you look at a feature or a column and then you immediately look at another feature after it, and you see that those, those pairs of features keep following each other in trees that might indicate that there’s a relationship between those two. And so this is a library that basically takes the output of a decision or of XGBoost and records on what are potential interactions in your data. So again, this is one of those cool things that you can use this to kind of do exploratory data analysis of your data after you’ve created a model, because you can go back and say, oh, these two columns, like gender and class in the Titanic are related. I may have thought of that, but I might not have. But the data is after I’ve made my model, the data is proving that out and it’s reporting that to me.

Jon Krohn: 00:52:12

Awesome. Those are some really cool Python libraries for using along with the XGBoost library. So Pandas is for pre-processing, SciKit-Learn for maybe also having some pre-processing steps in a pipeline, and Yellowbrick for visualizing the model and model performance, and then Xgbfir for giving us a bit of explainability into this dark gray box. Cool. So everything that we’ve talked about in this episode so far has been in the Python language. Is it possible to implement XGBoost in other languages as well?

Matt Harrison: 00:52:52

Yeah, so XGBoost is not actually implemented in Python, or, it’s implemented in C++. And then you know, this is a sort of dirty secret of things that work fast in Python. Python is a slow language, but makes for good glue. And if we have things that are a little bit snappier and we have a Python wrapper for it, that kind of gives us the best of both worlds. So there, there are wrappers for XGBoost in R, Java, I mean even Ruby or Swift, if you wanted to do that, or you could call XGBoost from C or C++ as well. So you kind of got your bases covered with I, what I would say are the most popular data science languages.

Jon Krohn: 00:53:46

Certainly. Yeah. Super cool. And then, so if there’s a listener out there who is hearing about XGBoost for the first time or getting excited about it as a result of listening to this episode, more so than before, and they want to try it out, what are the kinds of prerequisites that they need before trying XGBoost, you know, for example, by grabbing your book?

Matt Harrison: 00:54:07

Yeah. I mean, understanding what you would use XGBoost for and where it’s appropriate. So, so generally this is for supervised learning problems, right? Where we’re trying to predict a label in the case of classification, or predict a numeric target in the case of regression. And then I mean, I find myself teaching XGBoost to a lot of people who are subject matter experts, right? So they have data, they’re subject matter expert about the data, but then they need to make predictions about that data. So the more you can understand the problem domain, I think the better you’re going to be able to make a model. It’s often said, and I don’t, I wish I had like someone to attribute this to, maybe, you know, that if you have better data, you can make a better model or a dumber model and will often outperform a super fancy model if you have better data, right? And so certainly you could think of cases where, you know, if, if you slightly improved your data, you could throw it into logistic regression and probably do a as good as you could with, with a model, maybe like an XGBoost model for, for certain situations, right?

00:55:31

For other situations that might take a lot of time or effort or, pre-processing such that you’re doing so much processing on the data that you might as well just use something like XGBoost because it’s kind of going to do that for you instead. And then there is some minimal level of programming that you’re going to have to have. I mean, making, making a model using SciKit-Learn or XGBoost is not particularly hard. It’s like three lines of code to do that. So, so if you can get your environment bootstrapped or up and running, or you have some, you know, web-based version of the environment or a hosted version that you can use, it’s not particularly hard. But again there’s a lot of the work that you do before making the model and the work that you do after making the model such that, like those three lines of code aren’t really important because, because of that other stuff that is required.

Jon Krohn: 00:56:36

So just like calling a regression model in SciKit-Learn, calling an XGBoost model is straightforward. It’s just a few lines of code.

Matt Harrison: 00:56:44

It’s actually the same interface. So that, because it implements that SciKit-Learn interface, like, one of the things you can do is, is like you can take, you know, half a dozen different families of models in SciKit-Learn and XGBoost, and if your, if your data’s sufficiently pre-processed, again, XGBoost supports things like missing values and a lot of these SciKit-Learn models don’t. But if your data is in sort of this common subset that works with those other models, you could make a FOR loop and just loop over all these different models and evaluate them very quickly and see, you know, general performance characteristics of the different models.

Jon Krohn: 00:57:24

Very nice. So yeah, so the programming skills come in. They’re mostly required for being able to pre-process the data appropriately, understand the data before putting it into the model, do the kind of exploratory data analysis, make sure you understand that all the features you’re putting in make sense and you know, you’re not missing too many data, that kind of thing. And then of course afterwards when you are interpreting the results and figuring out how to visualize it and tell a story around it and then, of course, put it into production. Do you have any particular production tips for us that you can share quickly in a podcast format?

Matt Harrison: 00:58:02

Yeah, I mean, I think production, you’re going to want to use monitoring if you can. So monitor performance over time to make sure that you don’t have drift in your data. If you, I would say like another production tip is, if you leverage something like a SciKit-Learn pipeline, that’s going to make your life a lot easier. So a lot of folks, you know, this, this kind of goes back to our discussion that we had last year around chaining where I said that like, chaining is this practice that if you use it in Pandas, it’s going to make your life easier because you can go back and you can update your code or it’s going to read more like a recipe. And sort of similar thing applies if you’re using like chaining in Pandas combined with like SciKit-Learn pipelines.

00:58:59

It certainly is possible to like make models and deploy them without using those sorts of things. But if the model logic and everything that happened is like in 50 different cells, in someone’s notebook somewhere – trying to recreate that or redeploy the model is going to be a pain. So putting in some MLOps around that and I’m saying like MLOps, like using SciKit-Learn pipelines, but also that could be the case of, of like doing some monitoring but also doing some testing of your code and being able to deploy it quickly, can make your life a lot easier.

Jon Krohn: 00:59:38

Nice. So that kind of gives us some great tips on putting these models into production. And then what if in a use case where we’re not necessarily putting something into production, but we’re running this model to get some insights into the data or to be able to make some predictions in a non-production environment? So we might need to communicate results to non-technical stakeholders, to business people. How, do you have any tips for how we can be doing that effectively?

Matt Harrison: 01:00:10

Yeah. So I, meaning when I, when I talk about data science often say like, to me the most important skill of a data scientist is communication, right? Because a lot of people have this notion that a data scientist sort of sits in their ivory tower and then every so often comes down and says like, you need to do this, right? And they don’t, they don’t listen to the business. So you do want to have like these open channels of communication where the business feels like you’re listening to them and you understand them. So I would highly encourage, you know, starting that sooner rather than later. So good communication, making sure you’re understanding what, what the problem is. And part of that is also like what, what are the correct metrics? What, how, you’re going to evaluate your model and what are the metrics around that? When you’re explaining a model if you can explain it in terms of dollars, that might make it a little bit more compelling than to explain it in terms of like precision or recall these terms that tend to be jargon for a lot of people. You can sort of bring it to their level, that can be important.

Jon Krohn: 01:01:34

So, like savings or profitability. Like if we implement this model relative to our existing approach, we will save this much money or we’ll be this much more profitable.

Matt Harrison: 01:01:42

Yeah, yeah. Having a recommendation, you know, that has dollars attached to it rather than a lot of, you know, jargony speak can be useful. On a similar note, like we, we did say that like logistic regression, we can explain that and so if you could explain a model that might be useful, A lot of people are using libraries like SHAP to explain XGBoost models which I think is awesome and again, I can use a SHAP model to further do exploratory data analysis, that sort of thing, but I wouldn’t just hand over one of the plots that comes out of SHAP directly to some business stakeholder and expect them to understand it, right? So you need to be careful about what you’re handing over or you know, walk people through that if it’s necessary to sort of dive into those details.

Jon Krohn: 01:02:35

Nice. Great tips for being a data scientist in general there, Matt, as well as for particularly dealing with explaining XGBoost. I suppose XGBoost you know, using these kinds of dollar terms to make the point about what you’re doing with XGBoost could be particularly important given that you might not want to have to get into the like, “well there’s all these trees we got like, yeah, 582 trees and yeah, we get great you know, maybe we don’t have the best interpretability, but we do have performance” and, yeah, so.

Matt Harrison: 01:03:11

Yeah, if, I mean, if you can explain it like we’ll save a million dollars if we use the XGBoost tree, we’ll save half a million dollars if we use the logistic regression tree. You know, something like that can help grease the skids on model deployment.

Jon Krohn: 01:03:26

Nice. And then, we spent this whole episode learning about XGBoost and now how to communicate the results of an extra boost model effectively and put an extra boost model in the production. But Matt can’t ChatGPT just do all this for us?

Matt Harrison: 01:03:43

Yeah, great question. It couldn’t yesterday. It might tomorrow. I mean, it is changing pretty quickly. I’ll say this, like I use a calculator, I use a spell checker, and so I use tools that make me more productive and I found that using tools like ChatGPT and other AI tools that help me code can make me more productive. I don’t think that ChatGPT has a lot of the nuance that a subject matter expert would on the data. And so while ChatGPT can sort of speak in generalities of like how to apply XGBoost which might give you a decent model out of the box and again, getting a decent, getting an out-of-the-box model is not hard. It’s three lines of code, right? So can ChatGPT get you there? Yeah, it can probably get you there, right? But I think you really want to start going above and beyond that, especially when you want to start communicating these things to others.

01:04:53

And so lack of subject matter expertise often lack of critical thinking skills, you know, people ask me, is ChatGPT going to take over my job? And the common, the common answer that I’m liking, that I’m seeing is no, but someone who’s using it will if you’re not using it, right? So I don’t want to have my head in the sand saying that, like, I’m not going use this for whatever reason. I think you should be aware of what it can do, the pros and cons of it. I found that it helps augment my productivity and so I would use it where it does that, but I would also say that like bosses want a throat to choke, so to speak or someone to be able to explain a model to them, and you know, to say what’s going on here? And I don’t really see, I mean, maybe there’s some, yeah, I don’t really see like someone saying, okay, we’re going to replace our data scientists with ChatGPT, and the CEOs just going to ask ChatGPT to write some Python code for them. I don’t see that happening. So, I’m, again, long story short, I’m not, I’m not super afraid of my job or the job of those people who I’m training going away per se, but I do think this will augment their capabilities and a lot of them are going to have to up their game.

Jon Krohn: 01:06:21

Exactly. Yeah. It’s a super useful tool for saying, instead of looking up in the docks you know, how do I instantiate an XGBoost model? You can just ask ChatGPT to do it. And then you can even ask it helpful things like, you know, what kinds of the same kinds of questions I was asking you today, what kinds of production considerations might I want to have? And you can make it specific to your circumstances, like, given that my machine learning model is doing this and it’s going to be used in this way, what are the kinds of things that I need to be concerned about? And it might have some great ideas. So, but it’s ultimately, yeah, you, it’s going to come down to a human making the decision and deciding exactly what to implement, what amount of risk is appropriate for this particular, for this particular model in production, for example, and yeah ,so I agree with your assessment that ChatGPT won’t take your job, but somebody using it could.

01:07:16

Awesome. Matt, this has been a fantastic episode focused very specifically on XGBoost. Love having these kinds of episodes and based on how well your Effective Pandas episode did last year, I have a feeling this will be in the running for a top 10 in 2023. Fingers crossed. So beyond your book Effective XGBoost do you have a book recommendation for our listeners? We always ask for one at the end of the episode.

Matt Harrison: 01:07:45

Yeah, I mean, I would, I’d say a good book that I like is called Show Me the Numbers, and it is a visualization book. It’s actually light on code. There isn’t any code in it and so it’s, it’s language agnostic, but it, it’s got best practices and tips on making effective visualizations, which again, I think especially for like these black-box models, visualizations can tell a great story and I find that a lot of people have very cursory visualization skills. If they can bump those up a little bit, they can tell a lot better story.

Jon Krohn: 01:08:28

Nice. Yeah. Great tip. And Matt, you clearly are a deep expert on not just XGBoost, but Pandas, lots of other Python libraries, machine learning in general. I’m sure I’m even missing some of your expertise in that broad sweeping statement. But the point is you’ve got a lot of really valuable information for our listeners. They might want to be following you after the episode. What’s the best way to do that?

Matt Harrison: 01:08:54

Yeah, probably best way is on Twitter. My handle is __mharrison__ for Python people that’s Dunder mharrison. I also have a mailing list at metasnake.com if they’re interested in that.

Jon Krohn: 01:09:09

Nice. All right, Matt, thank you so much for coming on the show.

Matt Harrison: 01:09:14

My pleasure. Thanks for having me.

Jon Krohn: 01:09:16

Who knows. Next book topic, maybe we’ll have to have you on again.

Matt Harrison: 01:09:19

Okay, well I appreciate you letting me come back. So hopefully, hopefully we can do it again sometime.

Jon Krohn: 01:09:25

Yeah, my pleasure, Matt, catch you again soon.

Matt Harrison: 01:09:27

Okay, bye.

Jon Krohn: 01:09:33

Another year, another exceptionally informative episode of this podcast with Matt Harrison. In today’s episode, Matt filled us in on how XGBoost is an ensemble decision tree approach that affords extremely high classification accuracy that generalizes very well to new data. He talked about the hyperparameters such as model depth, regularization, class weights, number of estimators, and learning rate that we can fine-tune to maximize XGBoost’s potency. He filled us in on the HyperOpt library that leverages Bayesian statistics to perform hyperparameters search more efficiently and likely more effectively than a grid search. He told us that we should consider XGBoost is our first choice when we’re working with a large quantity of tabular data, complete model interpretability is not essential and we’re not looking to minimize model execution time. And adjacent to XGBoost, Matt, filled us in on the Pandas library for data pre-processing, SciKit-Learn for data pipelining, Yellowbrick for visualizing model performance and Xgbfir for model explainability, specifically gaining insight into interactions between features.

01:10:38

As always, you can get all the show notes including the transcript for this episode, the video recording, any materials mentioned on the show, the URLs from Matt’s social media profiles, as well as my own social media profiles at www.superdatascience.com/681. That’s www.superdatascience.com/681. I encourage you to let me know your thoughts on this episode directly by tagging me in public posts or comments on LinkedIn, Twitter, or YouTube. Your feedback is invaluable for helping us shape future episodes of the show. All right, thanks to my colleagues at Nebula for supporting me while I create content like this SuperDataScience episode for you. And thanks of course to Ivana, Mario, Natalie, Serg, Sylvia, Zara, and Kirill on the SuperDataScience team for producing another super practical episode for us today.

01:11:20

For enabling that super team to create this free podcast for you, we are deeply grateful to our sponsors whom I’ve hand selected as partners because I expect their products to be genuinely of interest to you. Please consider supporting the show by checking out our sponsors’ links, which you can find in the show notes. And if you yourself are interested in sponsoring an episode, you can get the details on how by making your way to jonkrohn.com/podcast. Finally, thanks of course to you for listening. It’s because you listen that I’m here. Until next time my friend, keep on rocking it out there and I’m looking forward to enjoying another round of the SuperDataScience podcast with you very soon.

Podcasts SDS 681: XGBoost: The Ultimate Classifier, with Matt Harrison

SDS 681: XGBoost: The Ultimate Classifier, with Matt Harrison

Podcast Transcript

Share on

Related Podcasts

October 14, 2025

October 10, 2025

October 7, 2025

Podcasts SDS 681: XGBoost: The Ultimate Classifier, with Matt Harrison

Share

SDS 681: XGBoost: The Ultimate Classifier, with Matt Harrison

Podcast Transcript

Share on

Related Podcasts

October 14, 2025

SDS 931: Boost Your Profits with Mathematical Optimization, feat. Jerry Yurchisin

October 10, 2025

SDS 930: In Case You Missed It in September 2025

October 7, 2025

SDS 929: Dragon Hatchling: The Missing Link Between Transformers and the Brain, with Adrian Kosowski