SDS 694: CatBoost: Powerful, efficient ML for large tabular datasets

If you’ve ever pored over preprocessing and feature engineering with categorical data in frustration and wondered if there’s a quicker way to complete these tasks, you’ll want to listen to this episode. Host Jon Krohn explains how CatBoost makes these tasks a breeze by eliminating the need to preprocess or manually add features with its automated capabilities.

Unfortunately, CatBoost doesn’t have anything to do with cats (unless they’re in your dataset!); the algorithm’s name is simply a blend word, combining “category” with “boosting”. We like CatBoost because it can be installed in several of our favorite data science environments – Python, R and Apache Spark, among others. This decision-tree algorithm leverages gradient boosting to give you a range of applications, from more accurate classifications for your data to vastly enhanced ranking and recommendation systems.

Listen to how CatBoost manages all this under the hood on this week’s Five-Minute Friday. SDS host Jon Krohn explains exactly how the algorithm’s techniques – target encoding and one-hot encoding – help CatBoost handle categorical features in a snap. Jon also clarifies the three-step approach that all the major tree-boosting algorithms (including XGBoost and LightGBM) follow to ensure their final predictions are accurate.

ITEMS MENTIONED IN THIS PODCAST:

DID YOU ENJOY THE PODCAST?

What key ingredients does an algorithm need, before you use it in your work?
Download The Transcript

Podcast Transcript

(00:05):
This is Five-Minute Friday on “CatBoost”.

(00:19):
Welcome back to the SuperDataScience Podcast. Today’s episode is dedicated to CatBoost, which is short for “category” and “boosting. This is a powerful, open-source tree-boosting algorithm that has been garnering a lot of attention recently in the machine learning community.

(00:34):
CatBoost has been around since 2017 when it was released by Yandex, a tech giant based in Moscow. I’ve provided links to the original paper as well as to all of the official technical documentation in the show notes in case you’d like to dig into either of these.

(00:48):
In a nutshell, CatBoost — like the more established, and regularly Kaggle-leaderboard-topping approaches XGBoost and LightGBM — is at its heart a decision-tree algorithm that leverages gradient boosting. If you’re unfamiliar with these gradient-boosted tree approaches, check out Episode #681, with XGBoost expert Matt Harrison to understand them.

(01:11):
Relatively briefly here however, tree-boosting algorithms like CatBoost, XGBoost and LightGBM follow a three-step approach. In step 1, we initialize the model, so we initially start with tipically a simple decision tree, and then the prediction made by this initial model is considered the baseline. n step 2, we iterate. So in each subsequent iteration, we take a new decision tree that is added to an ensemble, making this kind of like a random forest of decision trees. And so this iteration, this additional decision tree, this training process involves adjusting the weights of training examples, with a focus on the misclassified or poorly predicted instances. So this means, that the new tree that we iterate on is built to minimize the errors of the previous ensemble. It’s this focus on error-minimization that makes tree-boosted algorithms so powerful and efficient.

(02:10):
All right, so that’s step 2. So step 1 we initialize the models, step 2, we have iterative training. And then in step 3, that’s where we combine everything together into big ensemble, like again our random forest. And so, the predictions from all the trees in the ensemble are combined to form the final prediction. The combination mechanism varies depending on the algorithm and task but typically involves averaging or weighted averaging. All right, and so this comparison that I made a couple of times with random forest in the ensemble, that isn’t exactly correct, it’s just kind of a loose way of saying that we’re combining decision trees together. The way that we do it in random forests is different from the way that we do in this gradient-boost approaches. But hopefully, that just conveyses general concept of taking a bunch of decision trees and ensembling them together. Ok.

(02:59):
So boosting. Tree-boosting, that explains the “boost” part of CatBoost. And then let’s dig into the the “cat” part, the “category” part of the CatBoost name. And so that comes from CatBoost’s superior handling of categorical features. If you’ve trained models with categorical data before, you’ve likely experienced the tedium of preprocessing and feature engineering with categorical data. CatBoost comes to the rescue here, efficiently dealing with categorical variables by implementing a novel algorithm that eliminates the need for extensive preprocessing or manual feature engineering. CatBoost handles categorical features automatically by employing techniques such as target encoding and one-hot encoding.

(03:43):
Really quickly, one-hot encoding is where we represent all of the possible categories for a given variable as a vector of zeros except that for the single category that is represented by a given row of our data, we set it to a “one” — hence “one-hot”. And then target encoding is also known as mean encoding and simply involves replacing a categorical feature with the mean of the target variable for that category, so these two things together, one-hot encoding and target encoding are what CatBoost implements automatically into its novel approach to doing tree boosting. And so if you are interested in learning more about one-hot encoding and target encoding, I’ve included links to those approaches in the show notes for you to check out.

(04:28):
But yeah, back to CatBoost, it takes advantage to those approaches. And in addition to using those approaches, one-hot encoding and target encoding, to have superior handling of categorical features, CatBoost also makes use of something called Ordered Boosting, which is a specialized gradient-based optimization scheme that takes advantage of the natural ordering of categorical variables allowing CatBoost to minimize its loss function as it’s training more efficiently relative to other kinds of gradient boosting approaches like XGBoost and LightGBM. In addition, CatBoost makes use of Symmetric decision trees, which have a fixed tree depth, and this enables CatBost to have a faster training time relative to XGBoost and a comparable training time to LightGBM, which is, LightGBM is famous for its speed, so that’s impressive. And then on top of all that, CatBoost also has built in regularization techniques, such as the well-known L2 regularization approach as well as ordered boosting and the symmetric trees already discussed, so all together with this L2 regularization, the ordered boosting, the symmetric trees, this makes CatBoost unlikely to overfit to training data relative to other kinds of boosted-tree algorithms, which can be prone to overfitting.

(05:45):
All together, this means that CatBoost may be the best-performing option for a broad range of tasks including classification, regression, ranking, and recommendation systems. If you’re working with categorical variables, then it’s an even better bet for you. If you’re working with natural-language data, no problem because character strings can be vectorized into numbers.

(06:04):
All right, so hopefully you are excited about using CatBoost now, if you haven’t heard about it before, ori f you haven’t dug into it much before. And if you are, remember that it’s open-source, completely free, and in addition to that, installation is really easy. It can be installed in all of the most popular data science environments such as Python, R, and Apache Spark. Or you can even use it on the command line. It includes GPU acceleration, allowing you to train models faster or handle large data sets such as for large-scale machine learning tasks that you need to do across multiple GPUs. And CatBoost allows for interpretability. So it implicitly includes SHAP values so that you can understand the contribution of each model feature and explain the model output.

(06:52):
All right, I hope you’re excited to jump on the bandwagon and try out CatBoost for modeling tabular data if you haven’t already. For working with raw media inputs such as images, videos, audio or for generative models, you’re probably still going to want to use deep learning for all of those kinds of things. So when you’ve got images, video, audio or you are going to be generating something like natural language or an image, or whatever, you’re probably going to want to use deep learning, but if you are working with tabular data like you find in a spreadsheet, a boosted-tree approach like CatBoost is likely the way to go.

(07:26):
All right, that’s it for this week. Thanks to Shaan Khosla again, on the data science team at my machine learning company Nebula for providing the topic idea today and some of the content of today’s episode through a recent edition of his excellent Let’s Talk Text newsletter.

(07:42):
And thanks again for tuning in. Until next time, my friend, keep on rockin’ it out there and I’m looking forward to enjoying another round of the SuperDataScience podcast with you very soon.

Podcasts SDS 694: CatBoost: Powerful, efficient ML for large tabular datasets

Podcast Transcript

Share on

Related Podcasts

August 15, 2025

August 12, 2025

August 8, 2025

Podcasts SDS 694: CatBoost: Powerful, efficient ML for large tabular datasets

Share

SDS 694: CatBoost: Powerful, efficient ML for large tabular datasets

Podcast Transcript

Share on

Related Podcasts

August 15, 2025

SDS 914: Data Lakes 101 (and Why They’re Key for AI Models), with Oz Katz

August 12, 2025

SDS 913: LLM Pre-Training and Post-Training 101, with Julien Launay

August 8, 2025

SDS 912: In Case You Missed It in July 2025