82 minutes
SDS 649: Introduction to Machine Learning
Subscribe on Website, Apple Podcasts, Spotify, Stitcher Radio or TuneIn
After a couple of years away, Kirill Eremenko and Hadelin de Ponteves are back in the teacher's chair. The best part? They're here with a killer new course: Machine Learning in Python Level 1, and join us this week to give us a primer on Machine Learning. Tune in to sharpen your knowledge of ML concepts like classification errors, feature scaling, the elbow method and more.
About Kirill Eremenko
Interested in sponsoring a SuperDataScience Podcast episode? Visit JonKrohn.com/podcast for sponsorship information.
Kirill is the founder and former host of the SuperDataScience Podcast. He is the founder and CEO of SuperDataScience, the platform hosting this podcast, as well as featuring a lot of data science and analytics courses ranging from tool-based courses such as R Programming, Python, Tableau to over-arching courses like Machine Learning A-Z and Intro to Data Science. He is also a well-known instructor on Udemy. His courses have been taken by over 1.7M students worldwide. Kirill is also the Founder of DataScienceGO Conference, created for data powered minds, where experts, mentors, and friends come to enlighten, click and inspire each other and skyrocket their careers. Kirill is absolutely and utterly passionate about Data Science and delivering high-quality accessible education to every human on this planet!
About Hadelin de Ponteves
Hadelin is the co-founder and CEO at BlueLife AI, which leverages the power of cutting edge Artificial Intelligence to empower businesses to make massive profits by innovating, automating processes and maximizing efficiency. He is passionate about helping businesses harness the power of AI.
He is also an online entrepreneur who has created over 70 top-rated educational e-courses to the world on topics such as Machine Learning, Deep Learning, Artificial Intelligence and Blockchain, which have already made 2M+ sales in 210 countries.
Hadelin is an ex-Google Artificial Intelligence expert and holds an Engineering Masters degree from École Centrale Paris with a specialization in Machine Learning.
Overview
In the fast-paced world of machine learning, it's often difficult to keep up with all the latest developments. But every field (no matter how complex) has a foundation of key concepts, and ML is no exception. That's where AI experts like Kirill and Hadelin come in handy. With years of experience teaching thousands of students around the world, they've carved out simple explanations to complex problems, and this week, they join Jon Krohn to introduce–or re-introduce–you to the fundamentals of ML.
Whether you're trying to understand your customers better, optimize your production cycle, identify fraud or refine your marketing campaigns, ML can help you do it, and learning simple concepts is a great place to start. To kick things off, Hadelin tackles the basic topic of supervised vs. unsupervised learning. As he explains, supervised learning makes use of labeled datasets to train algorithms and predict outcomes. Unsupervised learning, on the other hand, leverages algorithms to uncover patterns in unlabeled datasets without human intervention.
In situations where sets of input values vary in terms of scale/range of values, you may wonder how this impacts results? As Hadelin explains, it depends on the model being used. This is where the concept of feature scaling comes in handy. It's a method that normalizes (between 0 and 1) or standardizes the range of variables, and in turn, improves the performance of your model. It's a vital step in the pre-processing of data for machine learning. Dive deeper into feature scaling and many more concepts, including linear regression and the Elbow Method, by tuning into the full episode.
Inside, you'll also receive an update on what Kirill and Hadelin have been up to these past few years, and you'll also preview their new course. After a long break from teaching, Kirill and Hadelin reflected on the state of AI, reviewed thousands of reviews, and poured over student questions to help them refine their new teaching style. Discover the new and improved format in the podcast episode today.
In this episode you will learn:
- Kirill and Hadelin's new course [17:34]
- Supervised vs unsupervised learning [26:23]
- False positives and false negatives [31:21]
- Logistic regression [43:00]
- Holding out a set of test data [46:39]
- Feature scaling [52:45]
- The Adjusted R-Squared metric [59:44]
- The five assumptions of linear regression [1:05:12]
- The Elbow Method [1:11:41]
Items mentioned in this podcast:
- Machine Learning in Python: Level 1
- Assumptions of Linear Regression cheatsheet
- SDS 499: Data Meshes and Data Reliability
- ChatGPT
- Will Humans Go the Way of Horses? by Erik Brynjolfsson and Andrew McAfee
- AI Crash Course by Hadelin de Ponteves
- Confident Data Skills by Kirill Eremenko
- Jon’s Podcast Page
Follow Kirill:
Podcast Transcript
Jon: 00:00:00
This is episode number 649 with the renowned data science educators, Kirill Eremenko and Hadelin de Ponteves.
00:00:11
Welcome to the SuperDataScience Podcast, the most listened to podcast in the data science industry. Each week we bring you inspiring people and ideas to help you build a successful career in data science. I'm your host, Jon Krohn. Thanks for joining me today. And now let's make the complex simple.
00:00:42
Welcome back to the SuperDataScience Podcast. Today is a special episode featuring not just one acclaimed guest, but two acclaimed guests at once. Kirill Eremenko, if you haven't already heard of him, is one of our two guests. He's the founder and CEO of SuperDataScience, an E-learning platform, and he founded this SuperDataScience Podcast in 2016 and hosted the show until he passed me the reigns two years ago.
00:01:09
Our second guest is Hadelin de Ponteves. He was a data engineer at Google in Paris before becoming a content creator. In 2020, he took a break from data science content to produce and star in a Bollywood film featuring Miss Universe, Harnaaz Sandhu. Together, Kirill and Hadelin have created dozens of data science courses and they are the most popular data science instructors on the Udemy platform with over two million students. After a multi-year hiatus from creating courses, they recently published a new course called Machine Learning in Python Level One. And that is what we're focusing on in today's episode.
00:01:48
So today's episode will appeal to anyone who's familiar with Kirill and Hadelin, and who'd like to hear about what they've been up to over the past couple of years, why they stopped creating courses, and why they're back at it now. After we get through that, this episode will also serve as an introduction to machine learning. So we'll primarily appeal to folks who aren't already expert at ML.
00:02:08
That said, I've been doing machine learning for over 15 years and I still learned a few critical new pieces of information during filming this episode. So this could serve as a fun lighthearted refresher even for experts. In this episode, Kirill and Hadelin introduce machine learning concepts such as supervised versus unsupervised learning, classification errors, logistic regression, feature scaling, the adjusted R-squared metric, the assumptions of linear regression and the elbow method. All right, you ready for this highly educational episode? Let's go.
00:02:47
Wow. On the SuperDataScience Podcast with Kirill and Hadelin. This is amazing. Welcome back to the show guys. Where in the world are you calling in from?
Hadelin: 00:02:58
Thank you. Thank you. I'm calling from Dubai.
Kirill: 00:03:03
And I'm calling from Gold Coast, Australia.
Jon: 00:03:06
Yeah. So for those of our listeners who are conscious of those planetary locations, I'm in New York and between New York, Dubai, and Australia, guess what? Some people are really suffering to make this episode for you. And I'm the only one who's not. So Hadelin is up late.
Hadelin: 00:03:29
Yes.
Jon: 00:03:30
In Dubai. And, Kirill, it's what, 4:00 in the morning or something crazy for you?
Kirill: 00:03:35
5:30 now.
Jon: 00:03:36
5:30 in the morning now. Yeah, but you had to probably get up at 4:30 to make this happen. So appreciate it, man.
Kirill: 00:03:42
I saw the sunrise. It's really good. For the first time.
Hadelin: 00:03:44
Yeah, I'm seeing the moon.
Jon: 00:03:52
So we've got a really special episode today. I'm really excited about this. We're going to talk about what you guys have been up to lately. And this is going to be a really interesting technical deep dive episode in a way that-
Kirill: 00:04:05
For beginners.
Jon: 00:04:06
Yeah, for beginners. We seldom go visit deep on technical stuff for listeners. It's really exciting. I like this format. I'm glad that you guys are prepared for this. So you both launched the extremely popular SuperDataScience hands-on catalog, of course, this is seven years ago. I feel confident, though I haven't rigorously checked this, that you guys are the top two data science instructors on Udemy all time. I feel like that's safe to say. You've reached over two million students worldwide, and now you are revamping your offering with new teaching styles and much more. What's changed in the last seven years, and why?
Kirill: 00:04:45
Well, thanks Jon. Thanks first of all for having us. Super excited to be back on the show. It's always a pleasure. I was telling Hadelin that I'm a little bit nervous coming on.
Jon: 00:04:55
Oh, really?
Kirill: 00:04:56
But excited. It's that nervousness and excitement, which is always good. And I guess what really changed was, in the seven years that we've been teaching the world of online education, first and foremost, has grown a lot and developed. Of course, data science has changed, but also the online education world has gone forward, and expectations have changed of students. For instance, mostly relating to the quality of content, of course. We've always strived for very high-quality content. So we were always confident about that.
00:05:36
But another thing that has changed is time. Everybody has less and less time. It's a stark difference to what it was seven years ago when people were excited and prepared to watch long tutorials and that was the norm. Now people want, and fair enough, we live in a very fast world, people want insights, and information, and value fast. So one of the things that we really worked on this time around was shortening our tutorials, really getting to the point very quickly and making a point. And it was very interesting to see, because we do have a lot of students, we have lots of reviews, we talk to our students when we need some feedback. And that was one of the big things. That's what I would say. What do you think, Hadelin?
Hadelin: 00:06:29
Yeah, I totally agree. First, what we have been doing for the past seven years is to listen to all students' feedback. We have the reviews in Udemy and I'm sure same as you, Kirill, I often read the reviews and try to understand what people like and don't like and how we can improve our courses. So yes, totally agree with you. I noticed that sometimes our tutorials were a bit too long. So you're definitely right. People have less time, but also have more distractions. And therefore, that reduces also the attention span. And that's why it's much better to fit video lectures into the attention span so that at least people can get the maximum knowledge.
Jon: 00:07:15
So what you're telling me is you converted all of your courses into a TikTok format?
Hadelin: 00:07:20
Yeah, it will go into that direction. Yes.
Kirill: 00:07:23
No, not all, of course.
Jon: 00:07:25
And another interesting piece of context here for listeners who aren't aware, so you guys have been on, and probably many people know that Kirill was host of the SuperDataScience Podcast for its first four years. I've now been hosting for a little over two years. But, Kirill, you were-
Hadelin: 00:07:40
Congrats.
Kirill: 00:07:43
Yay. We love Jon. Congrats, Jon.
Jon: 00:07:45
And so, Kirill, you were on the show in 2021 after I'd been hosting for about six months, and Hadelin was shortly thereafter as well. And at that time, neither of you was creating courses for Udemy. You weren't actively creating content after having been doing it for five years and having some of the most popular courses in the platform, certainly in data science. So what changed that now you've both decided to say, "Now's the right time, we've got to get back in and be creating new content."?
Kirill: 00:08:17
Hadelin, remember how we were talking about it at the end of those five years? We were like, "We're never creating courses again."
Hadelin: 00:08:26
Yeah, yeah, yeah.
Kirill: 00:08:28
Which I think we just burned out. We were so working so much on all these things. But what do you reckon?
Hadelin: 00:08:37
Yeah, for me, a lot of things happened in the past two years, and I've realized a lot of things. So first, I took those two years break to do a bit of cinema, and that was great. But doing only cinema is tough, because you always have to wait. There's not much going on. Especially as opposed to online education where we used to make one course per month. And so that was so dynamic.
00:09:03
And so I missed it at some point. I wanted to get back into it. And plus, I realized, and this is the big realization I've had recently, is that in life you really can grow, you can really leverage a high-level growth with the compound effect. And so since for seven years we had been doing those courses, well I could get back on a high-level growth by starting again those courses with Kirill. Because even though I took a two years break, well it's like riding a bicycle, getting back to it was easy and I could take from where I left things off. So yes, now I've decided to continue with online education, creating courses and at the same time continuing cinema as a hobby.
Jon: 00:09:54
Nice. And I actually, I remember in your episode, Hadelin, it was episode number 505. And in that episode, I specifically remember you saying, I didn't do research to remember this, I just remember that you said something like you will get back into it when you feel like there's something inspiring you to do it. That you felt like you had covered all of the things that you needed to cover at that time.
Hadelin: 00:10:24
That's absolutely true. Yes, that's absolutely true. At that time when we finished, when we started taking this break, indeed, it was saturated. We had taught everything we could have possibly taught with our knowledge. We had taught up to state-of-the-art AI, and we even expanded to other topics like blockchain. But at some point it's true, it's like, yes, there was not much else we could teach. But now so much has happened the past two years. And so there are many new things we can teach. And this is a really great feeling, because now I'm with, Kirill and I are teaching some new stuff, we're working on some new stuff. And at the same time we are learning, because a lot of things have happened. So it's very exciting.
Jon: 00:11:10
Nice. All right. And we are going to get into the content of your new course very soon. But I just have a curiosity question for you, which is, have you used data science to understand your students' needs? So you were talking about reading reviews to help you figure out where you should be shifting your content. Is there any kind of data science that you can actually apply to, I don't know, views, or I don't know, competitor information, anything like that?
Kirill: 00:11:43
Oh, thank God we have, otherwise we'd be so hypocritical sitting here saying we teach data science and we don't use it.
Jon: 00:11:50
I know. I felt a little nervous asking the question, because I was like...
Kirill: 00:11:53
No, no, we have, of course, we've used everything from surveys, the standard easy quantitative data collection, to the point of natural language processing of user reviews, because there's so many of them, hundreds of thousands of reviews over the seven years. So we would collect them to analyze them with NLP, or reviewing feedback, qualitative feedback that we get also through surveys, but for open-ended responses.
00:12:28
And to the point of, we have a chatbot that is, I think at some point we were getting, I'm not sure what his number is now, but we were getting 10,000 questions. I might be quoting this incorrectly, but somewhere around the lines of 10,000 questions per week. And we even got an award from Udemy for the most questions answered in some period, or year, or something. Yeah. This was in 2018.
00:12:53
And yeah, so we have a chatbot that does for the first tier of questions on Udemy. So if you ever ask a question, the first response will come from a chatbot, and then only if the chatbot's not able to help, then one of the teaching assistants will jump in, which whom we have also a few of working. And so yeah, so we have this whole database of typical questions and answers, and not only have we gone through them, but the chatbot has been created to be able to answer. Not as good as ChatGPT, but yeah, still. [inaudible 00:13:34].
Hadelin: 00:13:34
It would be great if we could integrate ChatGPT in our courses to answer the questions.
Kirill: 00:13:39
Yeah. And ChatGPT 4 is coming out soon, right? What, in a few weeks?
Jon: 00:13:43
So yeah, GPT-4. It may be up, actually, by the time this episode is released or shortly thereafter. But I think the versioning is, we use a different... It's ChatGPT version one is the new one.
Hadelin: 00:13:57
GPT-3.
Jon: 00:13:57
Is the one that's just came out. So I don't think we would call it ChatGPT 4, even though GPT-4-
Kirill: 00:14:02
Okay.
Jon: 00:14:05
I don't know.
Hadelin: 00:14:07
Yeah.
Jon: 00:14:08
But yeah, so amazing. I actually didn't know about that chat bot. And so yeah, you've gone... It looks like you've identified all the data science opportunities possible with your courses [inaudible 00:14:19].
Kirill: 00:14:19
Oh, I think there's always more. There's always more you can do.
Jon: 00:14:23
I guess, yeah, maybe by the time there's ChatGPT 4 in a few years, that will just create your course for you. Just, "ChatGPT 4, write me the script for my machine learning course."
Kirill: 00:14:36
Jon, but no joking, you know how you used in your season's greetings for Christmas, you used ChatGPT to write it. And that was really fun to listen to. Already ChatGPT has been out for, what, three, four weeks, a month now, or maybe two months, now creating new content, I use it for research. Rather than Googling things and finding out, oh, what's some technical question, and then going through 50 different links, I just ask ChatGPT and it gives me answers. But yesterday I found when it can lie to you. Just sometimes it says incorrect things. So you got to be careful. But it definitely helps with research.
00:15:20
One of my friends in Sydney is a lawyer, and he was visiting me here recently, and I just told him like, "Hey, there's this thing, what would you like to ask it?" And he's like, "Give me case law." Because for lawyers, case law, especially in the Western legal system, case law is very important, and it's really hard to find, and you know have to dig through thousands of cases to find the one relevant to you. He was like, "Give me case law relating to blah, blah, this topic." And then ChatGPT just spits out all of these things.
00:15:45
So I think to your point, it will definitely transform the way we do lots of different jobs, like how DALL·E 2 and Midjourney, and those things are already transforming art, artwork. You can just create it in five seconds. I think ChatGPT will also transform lots of things. And the applications are incredible. In fact, Hadelin has been working on a video or a series of videos on how to apply ChatGPT for data science. And he's already got 10 use cases that will blow your mind on how you-
Hadelin: 00:16:23
Yes. Yes. No, crazy thing, we planned to actually only do five data science use cases. But because it was so cool and so amazing, I ended up doing 10 without realizing.
Jon: 00:16:35
You just asked for five more from ChatGPT.
Hadelin: 00:16:40
Yeah. Yeah, exactly.
Jon: 00:16:41
Yeah, we actually did, a couple of episodes ago, an episode came out, number 646 with a layperson, a brewer who uses ChatGPT for lots of different purposes. And so we provided lots of links in the show notes to kinds of resources the people that listeners could access for coming up with ideas of using it themselves. But for example, this guy, this brewer, he's using ChatGPT for creating marketing copy, and for generating blog posts. And so it's cool to me that these kinds of tools like DALL·E 2, or Midjourney that you mentioned, Kirill, and now ChatGPT, you can be taking advantage of the state-of-the-art in AI without ever needing to write a line of code or having any experience in data science, which is really cool.
Kirill: 00:17:27
I know, yeah. Yeah.
Jon: 00:17:30
Yeah. Looking forward to seeing those cases as well. For those of our listeners who still want to learn the fundamentals of machine learning and be able to understand how they could be creating applications like this themselves as opposed to just using them, you guys have your new course, it's called Machine Learning Python Level 1. And it provides a solid foundation in machine learning with, of course, Python examples. And it's a super engaging course. Can you elaborate for us on the contents in this course? We can start with just a general overview and then I know we're going to dig into a lot of the technical content in detail.
Kirill: 00:18:10
Yeah, yeah, sounds good. Before we dive into that, I feel like if I were a listener, I'd be sitting there and thinking, "Where are those 10 use cases for data science, ChatGPT in data science?" right? But I just want to say, that's not the topic of this episode and we're still working on them. So maybe once they're released and once available we'll hit Jon up and maybe be able to share them somehow through the podcast without necessarily appearing on an episode.
Jon: 00:18:37
Yeah. Yeah. I mean, it could be ideal maybe for a Five-Minute Friday.
Kirill: 00:18:40
Yeah, yeah, something like that. We won't leave you hanging, don't worry. [inaudible 00:18:45] just announced it or mentioned it and didn't really go back to it. But yeah, thanks for the question. Hadelin, do you want to start maybe on this one? I think this course was actually Hadelin's original idea, right?
Hadelin: 00:19:00
Mm-hmm. So first, it's a course for beginners. We are starting from the very beginning, we were trying to lay the foundations of machine learning. So it's really a course on the foundations of machine learning, covering three branches, the three major ones, which are regression, classification, and clustering. And for each of these branches, we explain, well, Kirill explains the theory. And again, we have changed our style. We have brand-new slides that apply all the feedback we have listened to over the past few years.
00:19:36
And you also have a practical activity. And each of the branch covers one model. For regression, it's linear regression. For classification, it's logistic regression. And for clustering it is K-means clustering. And so yes, for each of those branches, we do the theory and a practical activity on a real-world case study. And we cover technical topics like the R-square, the adjusted R-square, the accuracy, the confusion matrix, the elbow method, and other topics.
Kirill: 00:20:09
And I would like to add that for those of the listeners who haven't taken any of our courses before, the way we teach is different to what you would expect from, I guess, a data science course. For example, Jon, your courses are very mathematical heavy. You even have a course on Mathematical Foundations of data science or Machine Learning, right? And pick up your book, it's also math heavy as I understand. Not in a bad way. You explain all the maths and that's important, right? You reiterated on the podcast, it's important to know all these to things to really understand data science on a very deep level.
00:20:56
But from what we teach, we focus on more of the... The approach, if you think of a car, I call it the car analogy. If you want to drive a car, you can learn all the things inside the engine, and understand what a crankshaft is, and what a camshaft is, and what the difference is, and how to fix things. Or you can just get in the car, understand the basics of where the gas is, where the brakes are, how to use the steering wheel. So get explained all the basic principles of how this thing works, and then get a lot of practice driving the car. And that's how most people, 95% of the population, learns to drive the car, right? Don't even need to know how to change the oil, or filters, or anything like that. Just where to put the petrol in.
00:21:44
And so that's the approach we take with data science, especially in beginner-focused courses. We remove all the heavy mathematics. We give a little bit of mathematics, the one that is very relevant. But ultimately, it's like, "Here's the gas, here's the brake, here's the steering wheel, this is where you put the petrol in." That's the part that I explained. And then Hadelin explains the driving back and forth to different locations. Hadelin gives all the practical aspects. And so in that way, while you're not becoming a deep expert in the field, you are getting a quicker start. You're getting a faster start. So this course is what, three hours. By the way, as I mentioned at the beginning, we're not trying to pitch the course on the podcast. We will give as much as we can in terms of the technical aspects in a few minutes.
Jon: 00:22:32
Yeah. Yeah. You mentioned that to me before we started recording. So the audience is hearing just that for the first time that you're not-
Kirill: 00:22:39
Yeah, yeah, yeah. I'm just using this course as an example. Even from listening to this podcast, you'll probably already learn quite a few technical aspects that you can apply without knowing the deep mathematics, but you can already apply. We can change the world. By we, I mean collectively, including the listeners, all of us, we can change the world by knowing how to correctly apply these very powerful tools. And then for those of us who are interested, of course, when we want to go deeper, yes, of course, learn all the mathematics and maybe improve the tools further. But the starting point, I think, should be at least understanding, like driving a car, how to apply these things and getting that practical experience so you're doing it correctly. So that's what we focus on, how we focus on building our courses.
Jon: 00:23:24
Nice. So let's dig into some of this content. Let's give listeners a great education here. Like you say, you're not here to sell, you're here to educate. So we are going to get big into these topics. So maybe before we have that conversation, who is your target audience for your Machine Learning Python Level 1 course? Which is probably similar to who your target audience is for the rest of this podcast episode.
Kirill: 00:23:50
That's a good point. I'll say from my side, and maybe, Hadelin, you can add to that. So I think I would say the target audience is anyone who wants to get started into machine learning, data science, who has no idea how these things work at all. That's number one. Number two, anyone who wants to not build a career in data science, but add data science to their existing skillsets. So you might be, for instance, in marketing, or you might be in operations, you might be in virtually any function of a business. You might be an entrepreneur. You want to get an edge and be able to apply some of these tools to understand maybe your customers better, or why you have a backlog in your business, or which company to invest, or where to launch your next shop, or something like that. So analyze some data that you have and you don't know how to do it.
00:24:54
And so that's, I would say, the two. And I think the third would be anybody who's already an intermediate data scientist and would like, but specifically would like to refresh on some of the things that are in the foundations, because once we get to get more experience and have more exposure to this field of work, it's easy to forget what was in the beginning and what are the foundations of this field. So if anybody wants to refresh their knowledge, this would also be relevant.
Jon: 00:25:30
Nice. Sounds great. And then I guess a big distinction between what we're about to do in the podcast right now versus your course is that when people do the course, it'll be hands-on in Python, whereas this podcast will not be hands-on in Python. So this will be, I think, the market for this particular podcast is even broader than your course because there might be lots of people out there who are like, "I don't know if I want to be learning Python right now." You're like, "I'm an investment manager, or I'm a busy executive." Or whatever, and you're like, "I don't think writing hands-on Python is in my future, at least not for now. But I would like to have an introduction to machine learning, understand all the concepts, all the key concepts."
Kirill: 00:26:11
Yeah, that's a very good point, Jon. Thanks. And yeah, hopefully we can give a lot of value to listeners in this podcast already.
Jon: 00:26:18
Nice. No doubt you will, given all of your experience. So you guys have come prepared with some specific topics that you want to cover. I know that the first one is explaining what supervised learning is in machine learning relative to unsupervised learning.
Hadelin: 00:26:33
Yes. And I'll take this one. So the main difference between supervised learning and unsupervised learning is that in supervised learning, you know what to predict, you have a dependent variable, which you already have the ground truth, so you know the labels for it in past data. And so you can use these labels in the dependent variable to train your model.
00:26:55
While in unsupervised learning, you don't know what to predict, and therefore, you don't have a dependent variable. You only have some inputs for any machine learning model, you will have some inputs. But in supervised learning, you will have an output with some labels. And in unsupervised learning, you won't have an output yet, because in unsupervised learning, what the model will do is that it will identify either some clusters or some patterns. And usually the way it identifies them is by creating a dependent variable, creating an output, which after you can transform to a supervised learning model.
00:27:31
So we give this big case study in one of our courses where you start with an unsupervised learning. You're trying to identify fraud in credit card transactions. So you use the unsupervised learning to figure out the fraud, and the model will actually find and build a dependent variable. After which, well, it will be able to predict if the result of that dependent variable for future credit card transactions.
Jon: 00:27:59
Cool. That sounds like a great example. Yeah. So with supervised learning, we have labels that we can use, as you described, the dependent variable in the model, the output of the model. So in both cases, supervised or unsupervised, you have some kind of input to the model. But with supervised learning, you also have this label that can be predicted. And yeah, you gave a really cool example there, including with credit card fraud where you could use unsupervised learning to predict what those labels should be, and then use that in a supervised learning model. But there also might be circumstances, I guess, where we use unsupervised learning without necessarily the intention of being able to do supervised downstream where we just want to understand our data better, identify some patterns, things like that, right?
Hadelin: 00:28:49
Correct. Yes. Yes. It's usually patterns or clusters. We also have this other example where we identify clusters of customers, and also we have as the input, the age, the estimated salary, their spending score, which is a score from one to 100, where the closer to 100, the more they spend in the mall. And at the end, what the clustering, therefore unsupervised learning model, identifies is different clusters of customers where some will spend more in the mall while having a low income or some don't spend too much in the mall while having a high income, or you have different categories. And then once you identify those clusters, then you can have different target advertising with different advertising that you could apply on them. And yes, this can lead to terrific results.
Jon: 00:29:49
Nice. Super cool. All right, so that's supervised versus unsupervised learning. And I imagine that must be one of the first topics that you cover in your Machine Learning Python Level One.
Kirill: 00:29:59
Oh, actually one of the last ones. Because once we get through clustering, we first talk about regression, then classification, then we get to clustering, that's when we cover...
Jon: 00:30:09
Nice. So when people are using regression and classification, they're doing supervised learning-
Kirill: 00:30:13
They don't know.
Jon: 00:30:13
But yeah, they don't know. And I guess it doesn't... That's a testament to how you guys do your course where you're giving practical applications and you are teaching the theory as it's needed. People don't need to know that there are these different categories, supervised, unsupervised, until you get to a point where you're like, "Oh, now we're getting into clustering, which is completely different, because now all of a sudden we have data without labels." Cool.
Hadelin: 00:30:44
Yes. Yes.
Jon: 00:30:45
All right. And so when you're training, your supervised learning models, there's a really important concept around whether it's getting the labels correct. So your supervised learning models, you could take some small percentage of the data, you could, say, take 20% of it, or 10% of it, and you keep it off to the side so that when you train your model, you can later evaluate and confirm that on data that were set aside, that the model didn't have to train with, that the model still works well. But there are situations where it misclassifies, it makes mistakes. And so these are called false positives, false negatives. Kirill, do you want to-
Kirill: 00:31:31
Yes, that's our second topic we wanted to cover today. False positives, false negatives. So if you imagine, let's talk about classification model. I don't know, for instance, a logistic regression. And we'll look at the example to make things more... Give more visual or impactful. Let's look at the example of predicting tumors, for example, like cancer, lung cancer, or something like that, which is a very important topic, because it affects people's health. So you got to get these things as right, as correct as you can possibly get them.
00:32:10
So let's say a computer vision model or some kind of machine learning model looks at some images of lungs. And after it's been trained, it's learned how to predict lung cancer, like your human, can still make mistakes. So by looking at a given image of somebody's lungs, it can say one of four things. Well, one of four things can happen. It can either say, "Yes, there is cancer." Or, "No, there isn't cancer."
00:32:41
Now in the case when it says, "Yes, there is cancer. And indeed, the person, in reality, let's look at the facts of the world, what is the reality of the universe we live in? Indeed, that person has cancer. That's called a true positive. That means the model said, "Yes, it's positive." And it's true that it's positive in the real world. Now, in the case when the model says, "No, the person doesn't have cancer, that's a negative." The model gives a negative. And in the real world, it can also be the case that it is, hopefully, that the person doesn't have cancer. So that's a true negative. So the model said, "No cancer, negative." And it's true that it's negative.
00:33:23
So those are the two outcomes, the only two outcomes that we want to have. Ideally, we want all of them to be true negative, so nobody has cancer. But the reality is that cancer does happen. So at least we want to know when it is correct. So those outcomes are acceptable for model. Those are the outcomes that we look for. True positives and true negatives.
00:33:42
Now, there's two other outcomes that are dangerous and that we want to minimize, but they still happen. And those are the errors. So for example, when the model says that the person has cancer, so it gives a positive result in its modeling, predicts that they do have cancer from the image. But in reality, in the real world, they don't have cancer. It's a mistake. The model made a mistake saying that they have cancer. So that is called a false positive. So the model is giving a positive result, the person has cancer. But in the real world, in the real state of things, it's false. And that's called a type one error. As you can imagine, it would be very devastating for a person to hear that they have cancer when they actually don't. And that's something that needs to be checked, or double-checked, triple-checked, or whatever. At least we want to minimize these kinds of errors.
00:34:36
And then there's a type two error, which is when a model says the person doesn't have cancer, so it gives a negative result. And in the real world, in the real state of things, the person does have cancer. And that's called a false negative. And that's a type two error. And arguably, that's even more dangerous, because in this case, the person has cancer, but they've been told that they don't have cancer. And so that cancer can progress, can get worse. And nobody's treating it, nobody's looking at it. And so a false negative is another type of error we really want to avoid, want to minimize, and it's a false negative.
00:35:15
And so if you can evaluate how well a model is working or how well a model... How predictive and how accurate a model is by taking all these four things, you can even put them together. It'll be hard to explain on a podcast what a confusion matrix looks like. But if you take all these four things and put them into a matrix, just two by two, two rows by two columns, that's called the confusion matrix. And if you add up the true positives and the true negatives, so the correct predictions, let's say you have 100 cases total, you have 80 true positives and 10 true negatives. So that's 90 that the model predicted correctly. And it has seven false positives and three false negatives. So if you add up the true positives and true negatives, and you divide by the total number of cases, so 90 divide by 100, you get the accuracy ratio, which is, in this case, 90%. So you want the model to be, the ratio to be as high as possible. If your model has only got an accuracy ratio of 50%, what's the point of that model? It's like flipping a coin, doing a coin toss.
00:36:22
So yeah, so that's basically false positive and false negative. And I guess, what we wanted people to take away from this podcast, understanding the difference between them. So if you keep this cancer example in mind, it'll help you get back to what the difference is. Because false positives is a type one error and it's bad, but at least the person is safe. They don't have cancer, they'll probably just go through a lot of stress and maybe unnecessary starting of treatment. But then eventually they'll probably find out they don't have cancer, hopefully. False negative, in my view, arguably more dangerous, because things can progress and get worse. And that's that's how I remember that it's a type two error.
Jon: 00:37:05
Yeah, yeah. I think and certainly in that circumstance, the type two error is the worst one. There could be some other kinds of models and-
Kirill: 00:37:11
Yes, of course.
Jon: 00:37:12
... Where, yeah, it becomes a bigger issue to have the false positive. So for example, maybe a machine learning algorithm that decides somebody should go to jail.
Kirill: 00:37:23
Oh, yeah. Yeah.
Jon: 00:37:25
Then maybe the false positives are even worse than the false negatives. You've already talked a bit about how in your course you have lots of examples, and you just now gave a great example with the cancer case study. Hadelin, do you have some more examples of applications of the three categories that you guys cover in your course?
Hadelin: 00:37:46
Yes, absolutely. Well, first, the cancer prediction example that Kirill just gave is one of the practical activity of the course. It is actually in the classification part, because I remind that the course covers the three foundations, essential branches of machine learning, which are regression, classification, and clustering.
00:38:09
Regression and classification are a part of supervised learning, because you know what to predict. In regression, what you have to predict is a continuous numerical number, a real value. And in classification, what you have to predict is a category. And in the category part, what we do is actually breast cancer tumor prediction. So yes, so we have this data set of patients from different hospitals for which we have gathered several features, like features of their breast and other medical features, which are numerical inputs. And as outputs, we have recorded whether they have had a cancer tumor.
00:38:52
So it's a binary problem. In classification, you either have several categories to predict, or just binary classification, where you just have two categories to predict, usually zero or one. And so in the data set, you have a zero if the cancer doesn't have... Sorry, if the patient doesn't have cancer. And one if the patient has cancer. And so that's a binary classification problem. And we solve this with logistic regression.
00:39:17
Then another practical activity that we do, especially for linear regression, is the prediction of the electrical energy output in combined cycle power plants. And so same for this, we have the inputs, we always have the inputs. And in that case it is features like the ambient temperature, the wind velocity, the exhaust vacuum, and other features that you can measure with the sensor in combined cycle power plants. So we have all these features and we're trying to predict the electrical energy output. And we do that with a linear regression model. Because even though a linear regression model is very simple and in one of the most basic machinery problems, it still gives excellent results and people are still using it today in many case studies. So it is great to have not only a model that is simple, but also that can still be widely useful today. So that's for the regression branch.
00:40:21
And finally, while I've already talked a bit about this, but for the unsupervised learning branch of machine learning, which here is K-means clustering, well, we do this case study of customers in a mall for whom we have several features such as the age, the estimated salary, and how much they spend in the mall, according to a spending score from one to 100. And we have no idea what to predict. We don't have a dependent variable, because this is unsupervised learning. But we are trying to identify clusters of customers that have different attributes, different properties, different patterns of how they interact or how they act in the mall. And so that's what the K-means clustering model will identify.
00:41:12
So yes, that's the three practical activities we do in the course. And these are quite real-world ones, because I know that this is typically how you can use machine learning, whether you do supervised machine learning or unsupervised machine learning in the real world.
Kirill: 00:41:31
Could I just add a quick disclaimer just so people don't think we collected that data from the hospitals. It just sounded to me [inaudible 00:41:41].
Jon: 00:41:42
You didn't collect it, you stole it.
Kirill: 00:41:45
No, we used an existing data set, of course, for the tumor data sets.
Hadelin: 00:41:49
Yes. Yes. That's a very good point. There is this machine learning repository called the UCI Machine Learning Repository that gives data sets that you can use for free and that you just have to cite. So we always cite those data sets in our courses. But yes, these are usually all data sets. You can find some recent ones, but yes, the ones that we use are pretty old. And even from the 20th century.
Jon: 00:42:17
Yeah. I mean, the age of the data set for the purposes of teaching these concepts doesn't really matter all that much as long-
Hadelin: 00:42:24
No, yeah, not really.
Jon: 00:42:26
Yeah, as long as it's an illustrative example. And all three of these sound like great illustrative examples with breast cancer, energy output, and shopping behavior clustering. Super cool. All right. So we've learned some of the key concepts behind when a logistic regression model is behaving correctly. So, Kirill, you gave a detailed explanation of false positives and false negatives as when the model is incorrect. So can you tell us a bit more about what a logistic regression model returns? So in this cancer situation where the model is predicting whether there's a tumor there or not, does it just output a zero or a one, a zero if there's not a tumor and a one if there is?
Kirill: 00:43:19
Jon, I thought you would know this, you have a course, you teach this stuff. I'm joking.
Jon: 00:43:27
I've had a lot of head injuries, Kirill.
Kirill: 00:43:30
Yeah. So that was one of our other points that we wanted to share with our listeners, that these classification models, they don't typically return either just a yes or no answer, or a zero, one. They return a probability. A probability of a person having a tumor, or a person not having a tumor, a customer leaving your shop, or a customer staying with you, whatever other thing that you're trying to classify, it's returning the probabilities rather than the exact output. And then you decide what to do with those probabilities. And that's, I think, an important consideration.
00:44:09
In most cases, what is done is, if the probability is 50% or more, then it's assigned a positive result. If the probability is less than 50%, then it's assigned a negative result. And that's how it's split somewhere in the middle. But you can change that. You can decide that maybe based on your business case or your application of the model or the domain knowledge that you have, you need a higher threshold for something to be a positive result. So you need it maybe to be a 70% prediction for it to be classified as a positive result.
00:44:47
Even in that case of tumors, you might decide that, well, for this model, for this specific application, we want threshold to be higher so we have a lower chance of giving a positive result. So because we'll check the rest of them down the line or something like that. But we really want it to be hard for the model to classify somebody as having a tumor. Or the opposite. You might want to have more of these, you want to set the threshold at 30% or 20% so that you get more of these results, so that you don't miss any tumors, and then you can double-check them with another model after that. So you have a bit of control there, actually a lot of control. And something to remember when you're going into classification problems.
Jon: 00:45:35
Nice. Thank you for illuminating that for me, Kirill.
Kirill: 00:45:39
Anytime, Jon.
Jon: 00:45:42
No, that's super well said, crystal clear. All right. And then so once you get those probabilities, and I already alluded to this earlier, I ended up stepping on your toes. [inaudible 00:45:53]. But so when you're getting these probabilities out, and then you decide on some threshold, maybe with the cancer detection, you don't just want to use .5 as the threshold as to whether something is detected as a tumor or not, because you're like, "Oh, it's really bad if we have type two errors, so we'll lower that threshold."
00:46:16
So once you actually have a classification as to whether something is positive or negative, then we can do the kind of comparison that you were describing earlier with the real world to say whether something is a false positive, or a false negative, or true positive, true negative. So as we're considering doing that, I mentioned earlier that we might want to have some of the data kept on the side, so we know that our model wasn't trained on those data kept off on the side. So maybe you could fill us in a bit more on that process and what those different kinds of data sets are called.
Kirill: 00:46:54
Of course. Of course. So basically we're talking about splitting data into a training set and test set. And it's an important consideration, especially for people starting out, because you don't want to be finding out the false positives and false negatives as one of the reasons. You don't want to be finding out false positives and false negatives later on when you are applying your data in the world and then you see, "Oh, well this person, they had cancer after all even though we told them they didn't. Oh, this person, we told them they did and they didn't have." And people have to go through all this suffering and all the turmoil that causes in their lives just for you to figure out that your model has a 65% accuracy ratio. That's not great, because you want to know that in advance.
00:47:42
And learning about the accuracy ratio from your training data. So if you have this data that you train your model on and then you just look back and see, "Oh, how did it go?" Well, that's not exactly foolproof, because the model can be over fitted to your training data. And the model might be performing really, really well in your training, might have a 95% accuracy ratio, and you might think that everything is great, happy days. But it just might be the case that your model has learned how to cheat the system. It's looking at your data, adjusting everything in the right way just perfectly for it... Because that's ultimately what the machine learning algorithm needs to do. And especially these rudimentary machine learning algorithms, the foundational ones, they're they're not super smart that they're going to know exactly what you want, like ChatGPT for example. They will just try to do the best with the data that you give.
00:48:45
So that's why you want to hide right away from very start, you want to hide 20% of your data or maybe 30% of your data, it depends on the application. But typically we use 20% of the data. You want to hide it, not even show it to the model when you're creating it, create the model, train it, get it to... And then get it to a final stage. And then before you deploy it in the real world, then that's when you take out your test data, the 20% that you set aside, and then you see how your model performs there. So effectively, you'll have two accuracy ratios. You'll have an accuracy ratio from your training data, which might be 95%, or want to aim for a higher number there. But the real one, the one that you can evaluate your data with confidently is your test data.
00:49:28
And as long as you collect your data as a complete, a true random sample, and as long as your data's overall representative of the sample that you have, the training and test sample, is representative of your population, of the bigger world of all your patients, or whatever you're predicting, then what you get from the test data, the results is going to be a good representation of what you will likely get in the real world.
00:49:51
So in your test data, you might get a smaller accuracy ratio, it might be 80%, but that's still already good, and that prepares you for the real world. But if in your test data you get an accuracy ratio of 55% as opposed to 95 in your trading data, that'll tell you that there's a red flag and you shouldn't deploy that model, you should look at it again and make adjustments, and that would be a dangerous thing to roll that model out.
Jon: 00:50:15
Nice. Makes perfect sense. So the test set allows us to be confident that our model hasn't just memorized some specific aspects of the training data. And so this means that our model is then also more likely to be effective in the real world when it encounters data points it hasn't seen before.
Kirill: 00:50:35
And that doesn't mean that you should just roll out your model and leave it. You should always still maintain your model. That's something we don't talk about in the course, because it's a bit more advanced. But, hey, there's an insight on the podcast, for the podcast, extra one. So you want to maintain your models, you want to always check. The models deteriorate over time. Things change, populations change. If we're talking about tumors, diets change, exercises change, the climate changes, pollution levels change.
00:51:07
So your model will deteriorate over time with a year, two years, whatever. So they need to be monitored and you need to be always checking what is their accuracy ratio now, how many false positives are we getting? How many false negatives are we getting? And so on. In any industry. In finance, regulation might change, and overnight your model might go from 80% accuracy to 43% accuracy because customers are no longer allowed to do this, or banks are not allowed to do this, something, some activity. So yeah, models need to be trained well, but also monitored afterwards as well.
Jon: 00:51:40
For sure. Yeah. And events that can happen in the real world that cause models to become out of date especially quickly, like a global pandemic happens and real-world behaviors change, or a whole bunch of new language comes into the language of the world, like the word COVID. And so if you had some natural language model that was made before the pandemic and you never updated those model weights and it doesn't know what the word COVID is, the model might not perform very well anymore.
00:52:08
And so when the real world changes around our model, we can call that feature drift. There's lots of startups out there that create tools for monitoring for feature drift in production. Yeah. So for example, we've had Barr Moses on the show in episode number 499 talking about her company Monte Carlo, which is designed to monitor for feature drift in production. So definitely an important real-world problem that can occur out there.
00:52:42
So another topic that seems like it's really important here is there could be situations where our input variables... So you guys have said whether it's supervised learning or unsupervised learning, you always have inputs into the model. But what if those inputs are on completely different kinds of scales? So one of your inputs has a very narrow range of values and the other one has a very big range of values. Can that have a big negative impact on how your model trends?
Hadelin: 00:53:24
Well, yes and no, because that depends on the model that you are using. And the problem is not only about having different scales, because actually you might have all your inputs in the same scale, but you would still need to have to apply feature scaling. So let's give the example of a data set where the features are all between one and 10, they're all into the same scale, so there's not this problem of having different scales. Well, yet you would still have to apply feature scaling.
00:53:53
So feature scaling is not only about putting all the features in the same scale, it's about putting all the features in the same scale, but that is a short range. And this short range can either be between zero and one, and that's normalization. So normalization is, take your feature, you subtract by the minimum of your feature divided by the difference between the maximum and the minimum of your feature. And then you also have standardization, which will put all your features in the same scale, but not between zero and one, but rather between minus three and plus three, or minus two and plus two. Because the formula of standardization is, you take your feature, you subtract the mean of your feature, and then you divide by the standard deviation of your feature. And that will put all the features in the same scales, which will be in a short range. And that will usually improve the performance of your model.
00:54:46
But now coming back to your first question, when you ask me, "Do we have to apply feature scaling?" And I told you yes and no. Well, that depends indeed on the model. For some models you have to apply feature scaling, and I will mention which ones, and for others you don't. So the best way to understand this is to take, for example, linear regression. With linear regression, you don't have to apply feature scaling. And why is that? That's because the coefficients in linear regression can adapt to put all the products of coefficient times the feature in the same scale. So if you have one feature in a very large scale and the other... Well, if you have one feature in a very large scale while the coefficient associated with it, meaning the one that is multiplied to it can be very small to put that product of the coefficient and the feature in the same scale as the other products of coefficient and feature.
00:55:38
So that's a good way to understand how feature scaling doesn't have to be applied for linear regression. But there is a rule of thumb in general, there is a way to remember much easily when to apply and when to not apply feature scaling. For example, you always have to apply feature scaling for gradient descent based algorithms, which includes logistic regression or all the newer networks based algorithm. So just a quick explanation on what is gradient descent. That's for supervised learning models where you have a dependent variable to predict. So first, your model will make some predictions. Then you will compare your predictions to the ground truths, the labels, the real results, and that incurs the loss. And you will apply a gradient on the loss with respect to the weights.
00:56:30
And whenever you use the techniques to reduce the loss with the gradient, well you always have to apply feature scaling. And other on the other way around. Well, the models for which you don't have to apply feature scaling are the gradient boosting ones. So all the models based on trees that make predictions as a team. For those models, you must not apply feature scaling. So these include, of course, decision tree, regression or classification. These include random forest regression reclassification, and also XGBoost, Light GBM, and CatBoost. So yes, for those models, you don't have to apply feature scaling.
Jon: 00:57:13
We joked earlier about me not knowing things, but that is something I did not know.
Kirill: 00:57:17
Wow.
Jon: 00:57:18
I did not know that there were some types of models, categories of models, ones that don't use gradient descent where you don't need to worry about feature scaling. So there you go. Super cool. All right, let's see what else I have to learn here. It won't be the next topic, which I am familiar with. So you've talked already in this episode about using accuracy as a benchmark of model performance. And I think this is conceptually a very straightforward to understand metric for evaluating a model. So you just take the cases that a model got right divided by the total number of cases and you've got accuracy. And people use accuracy, laypeople use accuracy all the time. So I think that makes a lot of sense to focus on for a Machine Learning Level 1 course as well as in this episode so far. But I understand that you would like to fill us in on some slightly more sophisticated metrics.
Hadelin: 00:58:22
Yes, which is the R-squared. So indeed, for classification, evaluating model is super intuitive to understand. The model makes predictions. If its binary classification, zero, one. And the accuracy will simply be the number of correct predictions divided by the total number of observations in a test set. So that's fine. However, for regression, it's not that intuitive to understand simply because what you have to predict is a numerical continuous number. And therefore, there is no such things as the number of correct predictions. Usually, the prediction will always be incorrect.
00:59:01
So it's more like you will see how close the prediction is to the real result to assess whether that prediction is rather correct or incorrect. The closer it is to the real result, the more correct it will be. And therefore, there is no such thing as the accuracy for regression. And instead we have the R-squared. And so the R-squared is indeed an evaluation metric for regression models that will take values between zero and one. And the closer the R-squared will be to one, the more accurate. But we should not say accurate, the better predictions will give your regression model.
Jon: 00:59:40
Cool. Makes perfect sense. And then I also know that there's something called adjusted R-squared. What's that about?
Kirill: 00:59:49
Love it, Jon. So adjusted R-squared is a metric that we would... It's better to use a just R-squared to avoid a certain pitfall of a multiple linear regression, where in a multiple linear regression you can add more coefficients and more independent variables. As you add more independent variables and coefficients, what happens to the normal R-squared... Without going into the math, so of course, we explained this a bit more detail in the course and the reasoning behind this. But in just intuitive or high level sense, why it's dangerous to use just the normal R-squared is because as you add more independent variables to your models, and more coefficients with them, what can happen is your model can just disregard independent variables that don't make any improvement to the model, but it can take any independent variables that make a tiny slightly improvement to the model and the formula to R-squared is such that it won't get worse.
01:01:13
The model has a trick. It can cheat the system and can just zero out any new independent variables that will make the R-squared worse. And it can accept tons and tons of independent variables, especially if you have a problem where you have lots of features that you could be using, you don't know which ones to use, and you just keep adding them. Then through some random chance correlations, your R-squared will be getting better and you'll end up with model of hundreds of independent variables that actually don't mean anything, but because of the way that R-squared is structured, the model has accepted them.
01:01:45
So that's why Adjusted R-squared was introduced. And that's what we want to share on the podcast. The takeaway from this point that we're sharing on a podcast is that the difference between R-squared and adjusted R-squared is that adjusted R-squared penalizes your multiple linear regression model for having extra variables, extra independent variables. And so that means that the variable has to actually make a good enough impact, has to have... Adds enough value to the model that is greater than the penalty that your model is getting in order for that variable to be accepted. So it's a way of building models with a reasonable number of variables where they all actually add useful value to the model, but not random chance minor [inaudible 01:02:37] correlations.
Jon: 01:02:38
So to give some specific examples, so you're saying that R-squared is a useful metric for evaluating models, particularly where we don't have some categorical output? So with the cancer detection model, we have this...
Kirill: 01:02:56
Oh, yes, of course. Yes. Then you would use-
Jon: 01:02:59
Then you would use accuracy.
Kirill: 01:03:00
Accuracy.
Jon: 01:03:02
But you could actually, you could use R-squared in those circumstances as well, because that can still be an interesting metric, because it tells you the proportion of the variance explained in your outcome. That R-squared is equivalent to saying, if you get an R-squared of .9, it means that what you call the independent variables that are features, the inputs to your model, they together explain 90% of the change in your outcome, which could be, in classification, could be the cancer, is cancer president or not? Or with your energy output example, it's how much energy is the power plant outputting, I guess, is the outcome.
Hadelin: 01:03:47
Yes. Electrical energy output. Yeah.
Jon: 01:03:49
Electrical energy output. But yeah, you make a really good point here. The adjusted R-squared is critical, because let's take that energy output example. If you had 200 possible inputs, independent variables, features that you could be putting into your model, if you just include all those, you'll, like you said, by random chance, you'll probably get an R-squared, a perfect because there's just so much opportunity for random meaningless variation to appear to be meaningful. So the adjusted R-squared penalizes you for adding more inputs to your model, more independent variables than you need.
Kirill: 01:04:28
That's very cool. Thanks, Jon. I learned something new today as well about the applying R-squared to classification problems. That's really cool.
Jon: 01:04:36
Hey, there you go. It's data science. There's an infinite amount out there.
Kirill: 01:04:39
Yeah.
Jon: 01:04:41
Yeah, you can never know everything.
Kirill: 01:04:43
Absolutely.
Jon: 01:04:45
Cool. All right. So we've talked about regression models a lot. So whether we're talking about the binary classification model, so in that case, we're using logistic regression, and then we also have linear regression for predicting these continuous outputs like how much energy is being output by the power plant? So are there particular kinds of circumstances where a novice data scientist might use a regression model, but there's something... It wasn't the right choice. There's something about the real world where even though they can apply regression, it will give wacky results.
Kirill: 01:05:35
Okay, yeah, sure. This is assumptions of linear regression. So this is specifically to linear regression. Oh, okay. So when it applies to logistic regression, which would be a classification type model, but for a linear regression, there's five main assumptions. And I love this. I spent a few days preparing this tutorial, and I loved it, because whenever you search online, you get different results. Some people say there's six or seven assumptions, there's some people say there's five. Some people say these assumptions, some people say other assumptions. So there's not one independent source of truth for this tutorial. And I couldn't find a really high quality video. So I am very excited about what we created here.
01:06:18
And basically, there's five assumptions of linear regression that I'll outline now. We'll go through them quite quickly. We'll share a link for the show notes. We created a cheat sheet that people can download for these assumptions of linear regression. So people can go to the show notes and just get it there. It's a nice one-page PDF that you can download, keep, and even print.
01:06:40
So assumption number one is linearity. So basically, we're not going to go into how to do all of these things programmatically or how to check these assumptions on a mathematical basis. We'll be talking about eyeballing your data and just gauging from that. So when you look at your data, if you want to apply a multiple linear regression or linear regression, when you look at your data, you should see generally a linear relationship. It should look like a linear relationship too. If it looks something like very different, I don't know, an exponential relationship, or I don't know, basically anything but a linear chart, probably you shouldn't be applying a linear regression to it, because you'd get incorrect conclusions.
01:07:26
Now, assumption number two is homoscedasticity. Very cool word, but basically means equal variance. So when you look at your data, if you imagine just a line chart with horizontal... Not a horizontal, a slightly sloped trend, and your data, just imagine a scatter plot going along this, your points of your data are scattered right along this sloped trend line. That's your data that you're applying a linear regression to. You don't want to see that data in a cone shape. You don't want the space between the data points and your presumed trend line to be increasing over as you progress further to the right, or decreasing as you progress further to the right. If you see a cone shape, that means that variance is dependent on the independent variable. And that's not good. That's not homoscedastic data. So you don't want to be applying linear regression to that.
01:08:22
Assumption number three is multivariate normality. You basically, if you have your line, assumed trend line, and if you look along the line, imagine you're looking along the line and you have your data points to the right and then to the left, you want them to be normally distant from this line. So you want normal distribution around this line. If you see something else, again, that wouldn't be a good candidate for linear regression.
01:08:52
Number four is independence of observations, which also includes the... You might see an assumption called no auto-correlation in other sources. And that basically means your data points should be independent of each other. You can't have a situation where one data point is affecting the next data point, affecting the next data point, and so on. The classic example of this is the stock market. In the stock market, data points and the price right now affects the price in the next hour, affects the price in the next hour, and so on. So they're not independent, shouldn't be applying a linear regression to modeling the stock market. You probably won't get the best outcomes.
01:09:29
And the fifth final assumption is lack of multicollinearity. You don't want your independent variables to be correlated. You want them to be uncorrelated as much as possible, because if they are correlated, you can still run a linear regression. Problem is that your coefficients will not be very reliable. We can go into detail on all that, but we won't... Basically, coefficients can vary because you have... Imagine just having the same variable twice in your integration, then the coefficients don't know what to do. Which one should be bigger, which one should be smaller? So coefficients won't be very reliable. You won't be able to predict the impact of each variable based on the coefficients.
01:10:14
There's one more, as far as assumption, there's a sixth one, but it's not an assumption, it's more of a check. You should always check for outliers. If your data set has outliers, you should decide for yourself if you want to model it with the outliers or if you want to remove the outliers before modeling. Some sources will include that as an assumption. It's not actually an assumption. It's more of a check. So yeah, once again, it's quite hard to visualize these things on a podcast. We ran through them quite quickly. There's going to be a PDF in the show notes. Please go ahead and download it, it's for your use there. We want people to be aware of these assumptions before you use linear regressions.
Jon: 01:10:51
Nice. Super cool rundown. And that PDF will definitely be helpful. I was able to follow what you were saying there, but maybe if it was the first time I'd ever learned those, it would be tricky. So I'm sure the PDF will come in handy for many of our listeners out there. So in your course you include a lot of topics, actually, like that one, assumptions of a linear regression. You get deep into the weeds on some topics that a lot of other beginner courses would just ignore. And so for example, you guys cover things like K-means++, adjusted R-squared, which we already talked about, the confusion matrix, which we alluded to a little bit earlier in the episode, things like ordinary least squares. So lots of relatively complex topics that beginner courses would usually skip, but you guys didn't. Why not? And maybe you could let us know about a specific method called the elbow method, which I had not heard about before, truly.
Hadelin: 01:11:48
Okay. So indeed, yes. We basically do not skip the concepts that are very important in the machine learning pipeline. This whole process from start to finish where you first pre-process your data up to making predictions and then deploying your model. Each of the tools that are used during this process, we cover them and we explain them. And it's true that for unsupervised learning and especially clustering, we use the elbow method to figure out the optimal number of clusters. Because it's true that when you do unsupervised learning with clustering, not only you don't know what to predict, but also you don't know the optimal number of clusters that you want to identify. So if we take, again, this data set of customers in the mall, sure, you know that you want to identify clusters of customers, but you don't know how many. And there is an optimal number that will lead to a great result.
01:12:43
And the method, technique that will help you figure out this right number, this optimal number is the elbow method. And why is it called method? Because it's actually a code that will plot a graph, the graph of a curve that looks like an elbow. And that optimal number of clusters will be found at the elbow. You just project the elbow onto the X-axis. Because in the X-axis you have the different numbers of clusters that you experiment. And on the Y-axis you have what we call the within cluster sum of squares. So the within cluster sum of squares is simply the sum of the squared distances between the observation points, meaning the customers, and the centroids of the clusters.
01:13:30
So I'll give you a simple example. If for example, we have 1,000 customers in the mall, and if we have 1,000 clusters, then the within cluster sum of squares will be zero, because each of the customer is the centroid itself. So the square distances is zero and the sum is zero. However, if you have one cluster, then you have therefore one centroid, and you will have a huge number of the within cluster sum of squares because you will have two non-neural distances from each customer to the centroid. And so you see, it's a curve that starts high, because when you have a low number of clusters, you have a big within cluster sum of squares number. And then it reduces as the number of clusters you experiment increases. And so at this point, at some point around in the middle of the curve, you have the elbow, and that's where you find the optimal number of clusters. And that's the elbow method helping you figure this out.
Jon: 01:14:32
Nice. Super cool. I'm glad to have learned about it. It's probably the reason why I hadn't heard of it is, I have not actually done that much clustering in my career. So great. Lots for me to learn in Machine Learning Python Level 1 review course. So speaking of which, the Level 1 implies to me that there's at least one sequel planned for this. You guys got something in the works? What's going on over there?
Hadelin: 01:14:59
Yes, we do. We do. So actually our original idea was to make a series. Instead of making just one course, we wanted to make a series in cinema kind of. And it's a series in three levels. Machine Learning Level 1, Machine Learning Level 2, and Machine Learning Level 3. And so indeed, in Machine Learning Level 1, we cover all the foundations of machine learning. Then in Machine Learning Level 2, we do, and implement, and learn about more advanced models and for different case studies and different applications of machine learning. And in Level 3, we cover even more advanced models, but for more specialized applications. So these are all the deep learning models, neural networks applied to, for example, computer vision, object recognition, and other, let's say, less standard applications. So yes, a Machine Learning Level 2 is in the pipeline. Yes.
Jon: 01:16:00
Super cool. So yeah, more advanced models coming up soon, no doubt. Very exciting. And so there will probably be more SuperDataScience Podcast episodes featuring some high-level summaries of the topics. I love that we did this episode. I don't think, certainly since I've been hosting the show, I don't think we've had an episode like this where it was really, this is an intro to machine learning. Let's cover all the topics. Kirill, had you done something like this before? [inaudible 01:16:28].
Kirill: 01:16:28
No, I don't think so. I don't think so. It's a first.
Jon: 01:16:30
Yeah. Super cool. So sure there are lots of audience members out there that will love it. I've got, prior to this episode, as I mentioned at the beginning of the episode, I posted on social media that you guys would be on the air. We got tons of engagement from people. And we have time for one question from Lalit Ravi Shankar Tangirala, he's a data management analyst, and he says, "Kirill, Hadelin, do you think in future, technologies like ChatGPT, these will replace AI/ML jobs?"
Kirill: 01:17:08
There's a cool book, which I think I read 2013, '14, maybe '15. It's called Will Humans Go The Way of Horses? And back in the day, 19th century, there was horses everywhere. No cars, right? [inaudible 01:17:26] cars were just starting out and so on, and people were worried there's problems related with having horses around, things like that. And then these cars started and people were like, "What will happen?" They never really believed in these cars. Horses will stay. And then, bam, all of a sudden cars are everywhere. We don't see horses in cities anymore.
01:17:48
So question is basically, that question is, will humans go the same way as horses did? And I don't believe so. I think humans are very adaptable. And if you look at 100 years ago, I'm probably misquoting these stats, but approximately around 90% of the US population was in agriculture. Now is, what is it, 5%, right? But have people become unhappy? Have people become redundant as in lost their jobs or no need? No, we have a higher population than ever and everybody has a job. Well, most people have jobs. Most people have found ways to apply themselves and rediscover, reinvent themselves. So I think the better way of looking at it is, it's inevitable is going to impact our jobs, but we can use it as a tool rather than see it as a threat.
Jon: 01:18:37
Great answer. Nice. All right, so it's been so much fun catching up with both of you on air. I've loved this episode. After this episode ends, which sadly all good things must come to an end, how can our listeners keep track of what you guys are up to?
Kirill: 01:18:58
Yeah, I guess it's a good question. I didn't even think of it. Well, first of all, you can find the course at superdatascience.com/start if you are interested to check it out. And so superdatascience.com/start. And other than that, Hadelin and I have a book. Each one of us has their own book. You can find those. You can follow us on mostly LinkedIn, right? Hadelin, I would say LinkedIn is the best way to find us out of all the social media?
Hadelin: 01:19:23
Yes. Yes. For me it's LinkedIn or Instagram. Yes.
Kirill: 01:19:27
Oh, yes. Hadelin, Instagram as well. So yeah, that's that. And yeah, we'll be in touch. We've got some exciting things that we're working on and I'm sure you'll hear from us again soon.
Jon: 01:19:38
Sweet. Can't wait. Thanks so much, gentlemen, and catch you on the SuperDataScience Podcast again sometime soon.
Kirill: 01:19:45
Thanks, Jon.
Hadelin: 01:19:47
Thank you, Jon.
Jon: 01:19:53
It's always a blast to be hanging out with Kirill and Hadelin. Hope you had fun too and learned a lot. I certainly did. In today's episode, Kirill and Hadelin filled this in on how supervised learning requires labeled data while, unsupervised learning can proceed without it. They also talked about how false positives are so-called type one classification errors, wherein, say, someone without cancer is flagged as having it. In contrast, false negatives are often serious type two classification errors wherein, say, a patient has cancer, but the machine learning model outputs that they're healthy.
01:20:28
They also talked about how having a held out set of test data enables us to ensure that our ML model hasn't simply memorized specific unique characteristics of our training data. How the R-squared metric allows us to evaluate models with a continuous outcome, like a regression, model and how the elbow method allows us to find the optimal number of clusters of data with an unsupervised machine learning approach.
01:20:50
As always, you can get all the show notes including the transcript for this episode, the video recording, any materials mentioned on the show, the URLs for Kirill and Hadelin's social media profiles, as well as my own social media profiles at superdatascience.com/649. That's superdatascience.com/649. If you too would like to ask questions of future guests of the show, like several audience members did during today's episode, then consider following me on LinkedIn or Twitter, because that's where I post who upcoming guests are and ask you to provide your inquiries for them.
01:21:23
All right, thanks to my colleagues at Nebula for supporting me while I create content like this SuperDataScience episode for you. And thanks, of course, to Ivana, Mario, Natalie, Serg, Sylvia, Zara, and yes, Kirill himself, on the SuperDataScience team for producing another super educational episode for us today.
01:21:40
For enabling that super team to create this free podcast for you, we are deeply grateful to our sponsors whom I've hand selected as partners because I expect their products to be genuinely of interest to you. Please consider supporting this free show by checking out our sponsors links, which you can find in the show notes. And if you yourself are interested in sponsoring an episode, you can get the details on how by making your way to jonkrohn.com/podcast.
01:22:03
Last but not least, thanks to you for listening. We would not be here at all without you. So until next time, my friend, keep on rocking it out there and I'm looking forward to enjoying another round of the SuperDataScience Podcast with you very soon.
Show all
arrow_downward